What is the algorithm to assign dates to a phylogenetic tree of SARS-CoV-2?


I have a few questions about the SARA-CoV-2 phylogenetic tree created at Nextstrain (https://nextstrain.org/ncov/global). 1. I found that only about 5000 SARS-CoV-2 genomes from GISAID were used to create the tree. I wonder how these mutations were selected. 2. I wonder how the tree is rooted. 3. It seems that there are multiple SARS-CoV-2 samples on a single branch, then what is the relationship between these samples? are they identical in genome or protein sequence? 3. I noticed that the X-axis of the phylogenetic tree of SARS-CoV-2 data is date (from Dec, 2019 to Sep, 2020). I wonder how the sampling date was assigned in the tree. Only the SARS-CoV-2 data that match the sampling date (from past to now) were shown?

I am quite new on building and interpreting the phylogenetic trees, and I am really confused about the Nextstrain tree. Hope someone can help me on these questions. Also, I wonder whether there are some papers/references about the algorithm that Nextstrain used to create the phylogenetic tree with date (in the X-axis). Thanks!

Best wishes,
Xiaoqiang Huang

Hi Xiaoqiang – please see Sagulenko et al for the approach we use to date the internal nodes of the tree. All of the steps we use to generate our SARS-CoV-2 datasets on nextstrain.org is detailed at github.com/nextstrain/ncov, which includes a tutorial where you can generate your own datasets.

