Understanding RSV clade assignment

Thomas · January 9, 2024, 6:05pm

Hi all,

I am trying to understand the clade assignment for RSV. In the downloaded metafile, most sequences have assigned a clade. This is even true if there is no coverage for G or F (e.g. JQ822101). Given that many of the defining mutations are not present in that case, I wonder how the clade could be assigned. Were these sequences clustered together with sequences for which all defining variants were available?

Related, how were the defining mutations selected? I can see that in RSV-A e.g. L 1661N can be found in most sequences of clade A.D.1, A.D.1.1 and A.D.1.2, but it is not a defining mutation.

Thank you!
Thomas

corneliusroemer · January 10, 2024, 5:05pm

Hi Thomas,

Great question! I think by “downloaded metafile” you refer to the files from data.nextstrain.org, is that correct, i.e. the ones linked to here: GitHub - nextstrain/rsv: Workflow for RSV analyses on Nextstrain.org

Clade assignment of partial sequences

The clade assignment is done by Nextclade and Nextclade assigns clades based on reference tree placement: the sequence is attached to the nearest neighbor on the reference tree. You can read more about the algorithm here: 5. Phylogenetic placement — Nextclade documentation

As the reference tree is based on the whole genome, sequences can be place on the reference tree and hence assigned a clade even if there is no coverage of G and F.

Selection of defining mutations

Regarding your second question of how defining mutations are selected, defining mutations listed in the yaml files are (I think) an arbitrary set of mutations that uniquely define a clade compared to the parent lineage, I don’t think they are meant to be an exhaustive list of mutations.

I think that the defining mutations are mostly used to annotate clades when using augur clades. I think @rneher has a script somewhere that turns the various yml files (e.g. A.1.yml) into a lineage.tsv that can be fed into augur clades.

The yml files define clades in two complementary way:

a) through representative sequences
b) through defining mutations

The clade starts at the common ancestor of all selected representative sequences, which will (hopefully) be identical with the first branch that has all the defining mutations.

I’m not sure if the consortium has defined what to do in case the two ways disagree. Maybe @rneher can jump iin here.

Does that answer your questions?

Best,

Cornelius

rneher · January 12, 2024, 8:59am

@Thomas ( thanks @corneliusroemer for the clear answers) the defining mutations are selected to uniquely differentiate a lineage from its parent. This list doesn’t necessarily include all differences. The lineages are defined via an annotated reference alignment and will be discussed in more detail in an upcoming publication. That publication will also contain a more comprehensive list of defining mutations.

Thomas · January 12, 2024, 11:47pm

Thanks a lot, Cornelius for your highly informative and clear answer! I just learned a lot.

Best,
Thomas

Thomas · January 12, 2024, 11:51pm

Thank you, Richard!

I am looking forward to the upcoming publication. And thank you again for this great resource!

Topic		Replies	Views
Where can I look up clade defining mutations for seasonal flu? Help and Getting Started	1	463	February 4, 2025
Reference tree for RSV used in NextClade webstie	3	206	June 19, 2024
RSV-A lineage reference sequences have discordant clade assignments Help and Getting Started	3	60	October 27, 2024
How to type RSV clades General	3	67	October 9, 2024
NextClade Variant Calling info General	11	812	June 20, 2024

Understanding RSV clade assignment

Clade assignment of partial sequences

Selection of defining mutations

Related topics