Simple question: for determining the mutations in full genome measles sequences, which strain is used as the reference (root)? As far as I can tell so far it is one that is subtype A, but I’d like to know the precise accession number, so that I can do some analysis of my own.
Hi @roberthohan
That’s NC_001498.1 currently
For the 2 Nextstrain builds this is:
- complete genome: /phylogenetic/defaults/measles_reference.gb
- 450bp region (1233…1682) of the N gene: /phylogenetic/defaults/measles_reference_N450.gb
(see description and the rest of this GitHub repo for details)
For Nextclade dataset it’s the second one, in fasta format: /phylogenetic/defaults/measles_reference_N450.fasta. The resulting dataset is in github.com/nextstrain/nextclade_data/data/nextstrain/measles
Is NC_001498.1 recommended to be used as the root of the tree? It doesn’t appear to be specified with the –root parameter in in the current workflow measles/phylogenetic/rules/construct_phylogeny.smk at main · nextstrain/measles · GitHub and it is also included in the list of dropped strains measles/phylogenetic/defaults/dropped_strains.txt at main · nextstrain/measles · GitHub
Is there a specific sequence or set of sequences recommended for use as the tree root to ensure it is consistent across analyses?
Thanks in advance!
NC_001498.1 is the reference sequence for the measles tree and is not used for rooting. The tree is rooted using TreeTime, which determines which rooting is most consistent with the sampling dates of the sequences, as described here:
Hopefully this documentation will help clarify the difference between the root and reference sequences:
NC_001498.1 is removed from the measles tree because augur align currently produces an error if the reference is present in the input --sequences and also given as the --reference-sequence:
Let me know if this doesn’t clear things up.
This makes sense, thank you for your response!
Is the reason the root isn’t specified, as it is for other pathogens like Mpox, that Measles does not have as clear/distinct clade structuring/divisions?
ex: mpox/phylogenetic/defaults/hmpxv1/config.yaml at master · nextstrain/mpox · GitHub
Ideally we let TreeTime root the tree, but if the clock signal in the data is weak, TreeTime may choose the wrong placement for the root. In this case, we often use prior knowledge to root the tree. For example, for the mpox clade IIb tree that you mentioned, we can figure out the root placement by looking at the mpox all-clades tree to figure out where clade IIb fits into the larger tree, and then use that information to manually define the root node, which for the mpox clade IIb tree is currently the common ancestor of MK783032 and MK783030.
A weak clock signal can be caused by various factors, e.g. a small sampling time frame, recombination, inconsistent clock rate across clades, etc.