NextClade Variant Calling info

I was also wondering what are the recommended % genome recoveries necessary for accurate SARS-CoV-2 variant calling using NextClade? I was thinking you could get a low % genome recovery from the consensus alignment, but maybe your primers sequenced key regions for accurate variant calling. What key regions on the SARS-CoV-2 genome should be sequenced in order for NextClade to accurately call a variant? Thanks! Katie

1 Like

Hi Katie! That’s a great question.

There is no single answer for what proportion of the genome allows accurate variant calling. In general, the higher the coverage, the more confident/accurate the answer.

It all depends on what you’re trying to achieve. If you give some more details about what you’re planning to do I will happily share some tailored advice.

In general, as most of the important antigenic evolution happens in Spike, it’s very useful to cover Spike, in particular everything between S:330-550, if you can sequence more, there’s also stuff happening in S:0-900.

Due to lots of homoplasy in spike, you won’t necessarily be able to pinpoint the exact lineage if you just have S:330-550, but for most purposes, I’d assume the RBD haplotype is the most important insight. If there’s a little bit of full genome sequencing in your region you should be able to fairly confidently identify the exact lineage that haplotype belongs to.

Here is a plot of where mutations are happening in XBB, y axis is the number of mutations per nucleotide for all designated XBB sublineages. You can see the clear pattern around Spike (though designation also focuses on that part so this is exaggerated). Other patterns are that generally there seems to be more mutations towards the 3’ end downstream of Spike.

You can get that graphic by scrolling down on this page: https://next.nextstrain.org/staging/nextclade/sars-cov-2/21L?c=clade_display&label=clade:22F and selecting “EVENTS” and “NT”.

Regarding confidence: we’re working on giving a confidence estimate for lineage calls, and also provide info about what the most and next most likely lineage is. This could be helpful for your use case. Would be great to hear what you’re planning to do. Thanks!

Hi @corneliusroemer ,

I have a follow up question for this post. I am doing a retrospective analysis on SC2 data to look at variant severity. In the model, I am grouping sequences by clade and looking at yes/no for hospitalizations. I am interested in understanding more about how the ‘good’ vs ‘mediocre’ vs’ bad’ qc.overallStatus is related to accuracy of lineage calls and if there has been any further work done for having a confidence estimate for lineage calls? Essentially, which sequences should be excluded because the lineage call can’t be trusted and is the qc.overallStatus the best indicator of this?

Thank you in advance!

Hi @laurenfrisbie,

If you’re looking at classifications at (Nextstrain) clade level, which is fairly broad, I don’t think there’s much uncertainty for whole genome sequences. Even if it’s spike only, there’s usually enough information in sequences to place them confidently at a clade level.

It’s true that we don’t quantify placement uncertainty at the moment.

Even if a sequence gets a bad QC status the clade assignment should usually still be reliable. The only reason that clade calls could be wrong is if you have a coinfection or a recombinant that hasn’t been designated, in which case there isn’t a “correct” clade.

I’m sorry I can’t give a more precise answer here, I hope the above helps nonetheless.

Best,

Cornelius

Hi @corneliusroemer,

Thank you for your insight! It is helpful to know that a bad QC overall status should usually have a reliable clade assignment.

Regards,
Lauren