GISAID database

Mahan.iz · June 30, 2022, 3:37am

GISIAD contains a large amount of covid genomes. Why do you select only a few of them?(2730 genomes)

corneliusroemer · June 30, 2022, 12:05pm

I think you’re talking about the global GISAID ncov build?

We subsample because

This improves balance across countries that submit different amounts of data, otherwise it would all be dominated by UK/US/Germany
It’s computationally infeasible to run IQtree on large trees >~10k samples
The viewing experience gets worse if you have large trees with thousands of samples

Mer.Explore · July 6, 2022, 7:48am

(post deleted by author)

Mer.Explore · July 6, 2022, 7:49am

Hi, there:

I’m wondering if there are more details about the subsampling scheme used for the global GISAID ncov build? Is the data selected balanced and representative enough? Really looking for your reply!

Thank you very much

Best regards,
Y.

trvrb · July 7, 2022, 10:00pm

Hi @Mer.Explore. You can see our subsampling criteria for the global ncov build here: ncov/builds.yaml at master · nextstrain/ncov · GitHub. This aims for 4000 total genomes split between older samples (Jan 2020 and 6 months previous) and more recent samples (6 months previous to today). Each continental region is equally selected from except for Oceania which has one third as many samples as other regions due to smaller population size.

Within a region we aim to equitable sample across space and time. The overall effect is to get roughly equal sample counts across countries and across months.

Mer.Explore · July 12, 2022, 8:26am

Hi @trvrb! Thanks a lot for your reply!! It clarifies a lot! Is this global subsampling data the one that is available on GISAID (the nextregions data in the download module)? If it is, does data in nextregions of different continents follow the same rule when subsampling?

Really looking forward to your reply! And, again, thank you very much!

Best regards,
Y,

trvrb · July 12, 2022, 5:04pm

The samples available via GISAID / Downloads / nextregions are exactly what’s available at nextstrain.org with nextregions / Global corresponding to nextstrain.org/ncov/gisaid/global/6m, nextregions / Africa corresponding to nextstrain.org/ncov/gisaid/africa/6m, etc…

The regional subsampling works a bit differently than the global subsampling. The exact details are in the same GitHub file and can be seen for Africa as region here: github.com/nextstrain/ncov/blob/master/nextstrain_profiles/nextstrain-gisaid/builds.yaml#L251. Basically this adds a further split between samples within region sampled at greater frequency relative to samples from outside the region. We have a 4:1 ratio of recent to early and a 4:1 ratio of focal region to global context.

Mer.Explore · July 29, 2022, 8:58am

Thanks a lot!!! It is very helpful!

Topic		Replies	Views
Global subsampling question General	3	571	July 24, 2020
Inconclusive data by country Site Feedback	9	952	August 29, 2021
Using Genomic Epidemiology from GISAID Help and Getting Started	1	339	November 5, 2021
Is there a maximum advised tree/sample size? Help and Getting Started	1	536	March 19, 2021
Subsampling Local DENV dataset based on genetic similarity Help and Getting Started	1	277	December 19, 2023

GISAID database

Related topics