I am trying to understand the clade assignment for RSV. In the downloaded metafile, most sequences have assigned a clade. This is even true if there is no coverage for G or F (e.g. JQ822101). Given that many of the defining mutations are not present in that case, I wonder how the clade could be assigned. Were these sequences clustered together with sequences for which all defining variants were available?
Related, how were the defining mutations selected? I can see that in RSV-A e.g. L 1661N can be found in most sequences of clade A.D.1, A.D.1.1 and A.D.1.2, but it is not a defining mutation.
Great question! I think by “downloaded metafile” you refer to the files from data.nextstrain.org, is that correct, i.e. the ones linked to here: GitHub - nextstrain/rsv: Workflow for RSV analyses on Nextstrain.org
Clade assignment of partial sequences
The clade assignment is done by Nextclade and Nextclade assigns clades based on reference tree placement: the sequence is attached to the nearest neighbor on the reference tree. You can read more about the algorithm here: 5. Phylogenetic placement — Nextclade documentation
As the reference tree is based on the whole genome, sequences can be place on the reference tree and hence assigned a clade even if there is no coverage of G and F.
Selection of defining mutations
Regarding your second question of how defining mutations are selected, defining mutations listed in the yaml files are (I think) an arbitrary set of mutations that uniquely define a clade compared to the parent lineage, I don’t think they are meant to be an exhaustive list of mutations.
I think that the defining mutations are mostly used to annotate clades when using
augur clades. I think @rneher has a script somewhere that turns the various
yml files (e.g. A.1.yml) into a lineage.tsv that can be fed into
The yml files define clades in two complementary way:
- a) through representative sequences
- b) through defining mutations
The clade starts at the common ancestor of all selected representative sequences, which will (hopefully) be identical with the first branch that has all the defining mutations.
I’m not sure if the consortium has defined what to do in case the two ways disagree. Maybe @rneher can jump iin here.
Does that answer your questions?
@Thomas ( thanks @corneliusroemer for the clear answers) the defining mutations are selected to uniquely differentiate a lineage from its parent. This list doesn’t necessarily include all differences. The lineages are defined via an annotated reference alignment and will be discussed in more detail in an upcoming publication. That publication will also contain a more comprehensive list of defining mutations.
Thanks a lot, Cornelius for your highly informative and clear answer! I just learned a lot.
Thank you, Richard!
I am looking forward to the upcoming publication. And thank you again for this great resource!