I hope to use the Nextstrain website for my introductory biology class this Fall. However, I have one vexing question about the interpretation of the default phylogenetic tree (ncov) on the website:
It seems that every branch represents one unique genotype, identified on the date shown on the x-axis (in the default view). What happens if another virus of the same genotype is identified? Is that then shown as the end of the branch, and the previous end ‘erased’? (I have yet to find sample placed internally on a branch) In other words, does the branch end indicate the last virus of that genotype to be identified, or the first virus of that genotype to be identified? I find it hard to believe that every genotype is identified only once.
Hi @plyons! The default tree view for SARS-CoV-2 on Nextstrain shows a “time tree” where each tip is plotted on the x-axis by its collection date and the internal nodes indicate the inferred time that ancestral strains circulated based on the observed collection dates and phylogenetic structure. In this layout, you will rarely see tips plotted on top of ancestral nodes. For example, here is the zoomed-in view of the Alpha clade in our global build:
You can change this default view to show a “divergence tree” by selecting the “divergence” button in the left navigation bar under the “Branch Length” section. In this divergence tree view of the same clade, we see tips and internal nodes plotted on the x-axis by the number of mutations they have relative to the root sequence of the tree.
Here we see that several tips fall on top of internal nodes indicating that these tips and the ancestral strain represented by the internal node have the exactly same genotype. If we added new strains to this tree with identical sequences to other strains, the new strains would be plotted at the same position on the x-axis as the other identical strains (with a slightly different y-axis position due to the “rectangular” phylogenetic tree layout).
Identical strains in this view can also appear as polytomies where the observed strains differ by genotype from their inferred ancestor but all observed strains are identical. I’ve highlighted two such polytomies in the following tree for New Zealand strains where two pairs of strains share a nucleotide mutation at 17009 (yellow tips) and then a sibling pair of strains share a mutation at 29296 (blue tips at the same x-axis position):
Thank you for your quick reply! Yes, this helps to understand the presentation of data, and I can see that the divergence view is better for understanding genetic relationships.
But, I guess I’m still curious to understand the meaning of the “time” x-axis. It appears from this data that a viral strain is very rarely (almost never, although you do show a number of identical strains) found again, or reported to this database again. I’m curious to know if this is a reflection of sampling (if the virus evolves rapidly enough, and sampling is distributed/slow enough, then we wouldn’t expect many identical strains), or reporting (an identical strain is not entered into the database because it is not a new strain). For example, most strains within Clade 20G appear to have been reported Feb-April 2021, but not since; most strains within Clade 21A(delta) appear to have been reported April-June 2021; meanwhile, Clade 20I(alpha V1) appear to have been reported across a wide time-span, Feb-June 2021. I’m guessing, based on this, that this is truly a reality based on sampling, rather than reporting, and it really is a rarity to sample the identical virus twice.