Downloading SARS-CoV-2 data from Nextstrain

corneliusroemer · March 11, 2022, 2:39pm

For posteriority, I’m documenting this question and answer I provided by email:

I am working on a scientific project about COV19.

I would like to get the sub-lineage information for the COV19 sequences. Since NextStrain has already labelled all COV19 strains with lineages.

I am wondering if I can download the corresponding data from somewhere? Or can you give me some suggestions for that?

Thanks.

There are two main lineage/clade systems in use:

Nextstrain clades (like 21I)
Pango lineages (like B.1.617.2)

Nextstrain clades are more coarse grained, there are only around 30 clades.

Pango lineages are fine grained, there are almost 2000 of them.

Nextstrain uses two data sources:

Open data from Genbank, these are only a subset of all available sequences, mostly from US, UK, Germany. You can download sequences and metadata curated by Nextstrain here:
Remote inputs — SARS-CoV-2 Workflow documentation
https://nextstrain.org/blog/2021-07-08-ncov-open-announcement
GISAID data is more complete but we are not allowed to share. You have to download the data yourself from GISAID. You can request an account there.

Depending on what your research goal is, open data would be easier to start, but less complete.

I hope this helps.

bagginstyrone · May 14, 2022, 10:27am

What data source you will recommend for a beginner like me who is learning about them for the very first time?

Topic		Replies	Views
SARS-CoV-2 lineage data download	1	388	April 18, 2023
How can I know the Nextclade_pango for each of SC2 sequences?	4	1038	June 15, 2024
How to download variant data? General	1	411	April 27, 2021
Correspondance of SARS-CoV-2 annotations (Nextclade - Pangolin) General	2	879	December 8, 2021
Spike protein sequences filtered for lineage General	1	605	February 10, 2022

Downloading SARS-CoV-2 data from Nextstrain

Related topics