For posteriority, I’m documenting this question and answer I provided by email:
I am working on a scientific project about COV19.
I would like to get the sub-lineage information for the COV19 sequences. Since NextStrain has already labelled all COV19 strains with lineages.
I am wondering if I can download the corresponding data from somewhere? Or can you give me some suggestions for that?
There are two main lineage/clade systems in use:
Nextstrain clades (like 21I)
Pango lineages (like B.1.617.2)
Nextstrain clades are more coarse grained, there are only around 30 clades.
Pango lineages are fine grained, there are almost 2000 of them.
Nextstrain uses two data sources:
Open data from Genbank, these are only a subset of all available sequences, mostly from US, UK, Germany. You can download sequences and metadata curated by Nextstrain here:
Overview of remote nCoV files (intermediate build assets) — SARS-CoV-2 Workflow documentation
GISAID data is more complete but we are not allowed to share. You have to download the data yourself from GISAID. You can request an account there.
Depending on what your research goal is, open data would be easier to start, but less complete.
I hope this helps.