Downloading SARS-CoV-2 data from Nextstrain

For posteriority, I’m documenting this question and answer I provided by email:

I am working on a scientific project about COV19.

I would like to get the sub-lineage information for the COV19 sequences. Since NextStrain has already labelled all COV19 strains with lineages.

I am wondering if I can download the corresponding data from somewhere? Or can you give me some suggestions for that?


There are two main lineage/clade systems in use:

  • Nextstrain clades (like 21I)

  • Pango lineages (like B.1.617.2)

Nextstrain clades are more coarse grained, there are only around 30 clades.

Pango lineages are fine grained, there are almost 2000 of them.

Nextstrain uses two data sources:

Depending on what your research goal is, open data would be easier to start, but less complete.

I hope this helps.

What data source you will recommend for a beginner like me who is learning about them for the very first time?