NextClade Variant Calling info

I was also wondering what are the recommended % genome recoveries necessary for accurate SARS-CoV-2 variant calling using NextClade? I was thinking you could get a low % genome recovery from the consensus alignment, but maybe your primers sequenced key regions for accurate variant calling. What key regions on the SARS-CoV-2 genome should be sequenced in order for NextClade to accurately call a variant? Thanks! Katie

1 Like

Hi Katie! That’s a great question.

There is no single answer for what proportion of the genome allows accurate variant calling. In general, the higher the coverage, the more confident/accurate the answer.

It all depends on what you’re trying to achieve. If you give some more details about what you’re planning to do I will happily share some tailored advice.

In general, as most of the important antigenic evolution happens in Spike, it’s very useful to cover Spike, in particular everything between S:330-550, if you can sequence more, there’s also stuff happening in S:0-900.

Due to lots of homoplasy in spike, you won’t necessarily be able to pinpoint the exact lineage if you just have S:330-550, but for most purposes, I’d assume the RBD haplotype is the most important insight. If there’s a little bit of full genome sequencing in your region you should be able to fairly confidently identify the exact lineage that haplotype belongs to.

Here is a plot of where mutations are happening in XBB, y axis is the number of mutations per nucleotide for all designated XBB sublineages. You can see the clear pattern around Spike (though designation also focuses on that part so this is exaggerated). Other patterns are that generally there seems to be more mutations towards the 3’ end downstream of Spike.

You can get that graphic by scrolling down on this page: https://next.nextstrain.org/staging/nextclade/sars-cov-2/21L?c=clade_display&label=clade:22F and selecting “EVENTS” and “NT”.

Regarding confidence: we’re working on giving a confidence estimate for lineage calls, and also provide info about what the most and next most likely lineage is. This could be helpful for your use case. Would be great to hear what you’re planning to do. Thanks!

Hi @corneliusroemer ,

I have a follow up question for this post. I am doing a retrospective analysis on SC2 data to look at variant severity. In the model, I am grouping sequences by clade and looking at yes/no for hospitalizations. I am interested in understanding more about how the ‘good’ vs ‘mediocre’ vs’ bad’ qc.overallStatus is related to accuracy of lineage calls and if there has been any further work done for having a confidence estimate for lineage calls? Essentially, which sequences should be excluded because the lineage call can’t be trusted and is the qc.overallStatus the best indicator of this?

Thank you in advance!

Hi @laurenfrisbie,

If you’re looking at classifications at (Nextstrain) clade level, which is fairly broad, I don’t think there’s much uncertainty for whole genome sequences. Even if it’s spike only, there’s usually enough information in sequences to place them confidently at a clade level.

It’s true that we don’t quantify placement uncertainty at the moment.

Even if a sequence gets a bad QC status the clade assignment should usually still be reliable. The only reason that clade calls could be wrong is if you have a coinfection or a recombinant that hasn’t been designated, in which case there isn’t a “correct” clade.

I’m sorry I can’t give a more precise answer here, I hope the above helps nonetheless.

Best,

Cornelius

Hi @corneliusroemer,

Thank you for your insight! It is helpful to know that a bad QC overall status should usually have a reliable clade assignment.

Regards,
Lauren

I hope I am allowed to add my questions here. We are using NextClade to assign RSV clades. As I understand it, the clade assignments come without a confidence level. As long as full-genome sequences are being used, the clade assignments should be reliable. Is there a way to know how reliable the clade assignments are if I have only sequencing data for some of the genes. I could restrict available full-genome sequences to individual genes to get a general idea, but I was wondering if there is a way to obtain a confidence level per sequence.

Thank you!
Thomas

1 Like

Hi @Thomas! Of course you are allowed to add your question!

At time of writing (v3.4.0) Nextclade has no inbuilt way of reporting confidence. One reason is that confidence has many dimensions and Nextclade only knows about a few, so any confidence we report might be misleading.

Nextclade could output placement uncertainty, saying how many equally parsimonious placements there are, but that on its own might not be very informative. Adding what the clades of the various equally optimal placements are would be possible, but quickly get hard to interpret.

Given that RSV has many substitutions distinguishing lineages, there should not be much uncertainty even if you have only a half length sequence or even a quarter.

As confidence is something I’ve been thinking about, it’d be great to hear a bit more about your use case for it. Is your question more of a general nature regarding how short a sequence you can get away with to save resources or do you have a concrete application of individual confidence values in mind?

1 Like

Hi Cornelius,

Thank you for your answer! I am interested if I can get accurate genotyping results if I don’t have Full Genome sequences but have only certain genes available.

I did a small test and fed nextclade with some Full Genome sequences and then extracted L and G and only provided the sequences containing either L or G.

Using L, I got identical genotypes for all sequences that I tested.

Using G, I had some differences in the genotype assignments. I noticed that the sequence coverage of sequences that are correctly assigned is about 0.06 and the alignment start, and alignment end positions seem correct. For the sequences where I obtain a different genotype between full genome and G, the alignment start pos also seems correct, but the alignment end position is around position 12,000. I am happy to provide you with a sequence where I encounter this.

Thank you again for your help!
Thomas

Hi everyone,

I wanted to follow-up on the differences of nextclade RSV clade assignments (web tool) based on full genome sequences vs G only. An example of a sequence where this happens is the ncbi accession KJ723462. Next-clade returns A.2 based on the Full Genome and A.D.2.2.1 based on G only. The output shows 7014 gaps for G, so I think the difference could come from a bad alignment of the G sequence.

Should I better open a new topic for this?

Thank you!
Thomas

Hi Thomas,

I agree that the difference most likely comes from a bad alignment.
I have just tried the accession you provided and I am not seeing the misassignment you have observed. Even reducing G to only about 600 bases still results in the same A.2 assignment:

could you share more details on what you did?

best,
richard

Hi Richard,

Thank you for looking into this and my apologies for the delayed response.

For the accession i mentioned I go to NCBI and then select the accession i mentioned (Human respiratory syncytial virus strain RSVA/Homo sapiens/USA/92I-068 - Nucleotide - NCBI), then I select the CDS for G: (Human respiratory syncytial virus strain RSVA/Homo sapiens/USA/92I-068 - Nucleotide - NCBI). I create a fasta file and then load it into nextclade (https://clades.nextstrain.org/). I repeated this steps right now and still obtain lineage A.D.2.2.1 if I use the full G sequence. I also truncated G after 425 nucleotides and indeed I obtain lineage A.2.

I am blocked within my company to upload a screen shot, but I could email it to you if that would help you.

The exact sequence I am using is the following:

KJ723462.1:4612-5508 Human respiratory syncytial virus strain RSVA/Homo sapiens/USA/92I-068A-01/1992, complete genome
ATGTCCAAAACCAAGGACCAACGCACCGCCAAGACACTAGAAAAGACCTGGGACACTCTCAATCATCTAT
TATTCATATCATCGTGCTTATACAAGTTAAATCTTAAATCTATAGCACAAATCACATTATCCATTCTGGC
AATGATAATCTCAACTTCACTTATAATTGCAGCCATCATATTCATAGCCTCAGCAAACAACAAAGTCACA
CTAACAACTGCAATCATACAAGATGCAACAAGCCAGATCAAGAACACAACCCCAACATACCTGACCCAGA
ATCCCCAGCTTGGAATCAGCTTCTTTAATCTGTCTGGAACTACATCACAAACCACCGCCATACTAGCTTT
AACAACACCAAGTGTCGAGTCAATCCTGCAATCTACAACAGTCAAGACCAAAAACACAACAACAACCCAA
ATACAACCCAGCAAGCCCACCACAAAACAACGCCAAAACAAACCACCAAACAAACCCAATAATGATTTTC
ACTTTGAAGTGTTCAACTTTGTACCCTGCAGCATATGCAGCAACAATCCAACTTGCTGGGCCATCTGCAA
AAGAATACCAAGCAAAAAACCTGGAAAGAAAACCACCACCAAGCCCACAAAAAAACCAACCATCAAGACA
ACCAAAAAAGATCTCAAACCTCAAACCACAAAACCAAAGGAAGCACCTACCACCAAGCCCACAGAAAAGC
CAACCATCAACATCACCAAACCAAACATCAGAACTACACTGCTCACCAACAGTACCACAGGAAATCTAGA
ACACACAAGTCAAGAGGAAACTCTCCATTCAACCTCCTCCGAAGGCAATACAAGCCCTTCACAAGTCTAT
ACAACATCCGAGTACCTATCACAACCTCCATCTCCATCCAACATAACAAACCAGTAG

Best,
Thomas

Hi Thomas,

I could reproduce this issue. This is a but of an unlucky case where the fairly default low gap penalty is not sufficient to prevent a spurious alignment where the last part of the sequence get aligned in the wrong place to avoid mismatches with the reference. the reference happens to be clade A.D.2.2.1 and this pulls your sequence towards the reference. it doesn’t happen for longer or shorter sequences, because then the spurious alignment isn’t favorable.

I’ll increase the gap penalties in future version of the dataset to avoid these miscalls.

thanks for flagging this, this was instructive!
richard

1 Like