RSV-A lineage reference sequences have discordant clade assignments

I am working on building a high quality reference tree with sequences from all of the RSV-A lineages defined using the 2024 Consortium criteria. My goal is to use this tree as a reference to infer the clade of novel sequences.

However, I have run into a discrepancy where two of the sequences listed as clade A.2 on the Consortium GitHub cluster with Clade A on Nextclade (and are therefore assigned Clade A by Nextclade). The two sequences in question are MG642058 and MG642028.

I am not sure whether this discrepancy represents an error in one of the datasets, or simply an ambiguous result. I could use some help understanding why this is happening and how I should treat these sequences.

A few more details:
To start, I downloaded the aligned reference sequences from the Consortium GitHub. I then loaded this into Nextclade Web and got the following result. Two of the A.2 sequences (MG642058 and MG642028) cluster with the A sequences and apart from the A.2 seqeunces.

Since no Clade A sequences were included in the aligned FASTA, I added Clade A reference sequences from the Consortium GitHub (MG642074, LC741417, and OK649668). I re-aligned all the sequences using MAFFT. Then, I built a tree using augur.

In this tree, the discordant sequences are more closely related to other Clade A.2 sequences than to the Clade A sequences. (Apologies for the similarity of the clade colors).

I believe this result indicates that these two sequences are most appropriately called Clade A.2, and that I could include them as examples of Clade A.2 in my reference tree. Is this correct? If anyone has insight into this, could you please confirm this conclusion or help me correct my understanding?

Hi @davidbacsik (good to see you here!),

Looking at the line-by-line history of the lineage A clade definitions, it looks like one of the problematic representative sequences was added earlier (Nov 14) than the defining mutations/substitutions for A.2 (Nov 25).

The defining mutations in that YAML file get exported to this TSV file that then gets used to assign clade labels to nodes in the Nextclade dataset. Based on the timing of events, I would think the defining mutations are correct and that the A.2 annotation for the two sequences you mentioned above might reflect an incorrect manual assignment to that clade (e.g., MG642058 shouldn’t be listed as a representative of A.2).

The following screenshot shows one of the two sequences you mentioned above and another sequence from the Nextclade dataset coloring the tree by genotype at sites F 540 and 547, showing that the MG642058 sequence lacks the defining F alleles for clade A.2.

@rneher Does that interpretation of event seem correct?

1 Like

thanks for raising this! it looks like there is some inconsistency between the clades.tsv and the reference alignment. I think the A/A.2 distinction makes more sense if aligned with the two mutations in F, but the reference alignment wants to place the switch a bit further up the tree. I’ll try to sort this out with the next update.

Hi @jlhudd and @rneher,

Thanks a lot for taking the time to think through this. I appreciate it!

Sounds like the F mutations are a good way to distinguish the clades. I’ll remove the MG642028 and MG642058 sequences from my list of representative A.2 sequences, since they both have S/L genotypes.