NextClade Variant Calling info

Hi Katie! That’s a great question.

There is no single answer for what proportion of the genome allows accurate variant calling. In general, the higher the coverage, the more confident/accurate the answer.

It all depends on what you’re trying to achieve. If you give some more details about what you’re planning to do I will happily share some tailored advice.

In general, as most of the important antigenic evolution happens in Spike, it’s very useful to cover Spike, in particular everything between S:330-550, if you can sequence more, there’s also stuff happening in S:0-900.

Due to lots of homoplasy in spike, you won’t necessarily be able to pinpoint the exact lineage if you just have S:330-550, but for most purposes, I’d assume the RBD haplotype is the most important insight. If there’s a little bit of full genome sequencing in your region you should be able to fairly confidently identify the exact lineage that haplotype belongs to.

Here is a plot of where mutations are happening in XBB, y axis is the number of mutations per nucleotide for all designated XBB sublineages. You can see the clear pattern around Spike (though designation also focuses on that part so this is exaggerated). Other patterns are that generally there seems to be more mutations towards the 3’ end downstream of Spike.

You can get that graphic by scrolling down on this page: https://next.nextstrain.org/staging/nextclade/sars-cov-2/21L?c=clade_display&label=clade:22F and selecting “EVENTS” and “NT”.

Regarding confidence: we’re working on giving a confidence estimate for lineage calls, and also provide info about what the most and next most likely lineage is. This could be helpful for your use case. Would be great to hear what you’re planning to do. Thanks!