I am working on building a high quality reference tree with sequences from all of the RSV-A lineages defined using the 2024 Consortium criteria. My goal is to use this tree as a reference to infer the clade of novel sequences.

However, I have run into a discrepancy where two of the sequences listed as clade A.2 on the Consortium GitHub cluster with Clade A on Nextclade (and are therefore assigned Clade A by Nextclade). The two sequences in question are MG642058 and MG642028.

I am not sure whether this discrepancy represents an error in one of the datasets, or simply an ambiguous result. I could use some help understanding why this is happening and how I should treat these sequences.

A few more details:

To start, I downloaded the aligned reference sequences from the Consortium GitHub. I then loaded this into Nextclade Web and got the following result. Two of the A.2 sequences (MG642058 and MG642028) cluster with the A sequences and apart from the A.2 seqeunces.

Since no Clade A sequences were included in the aligned FASTA, I added Clade A reference sequences from the Consortium GitHub (MG642074, LC741417, and OK649668). I re-aligned all the sequences using MAFFT. Then, I built a tree using augur.

In this tree, the discordant sequences are more closely related to other Clade A.2 sequences than to the Clade A sequences. (Apologies for the similarity of the clade colors).

I believe this result indicates that these two sequences are most appropriately called Clade A.2, and that I could include them as examples of Clade A.2 in my reference tree. Is this correct? If anyone has insight into this, could you please confirm this conclusion or help me correct my understanding?