GISIAD contains a large amount of covid genomes. Why do you select only a few of them?(2730 genomes)
I think you’re talking about the global GISAID ncov build?
We subsample because
- This improves balance across countries that submit different amounts of data, otherwise it would all be dominated by UK/US/Germany
- It’s computationally infeasible to run IQtree on large trees >~10k samples
- The viewing experience gets worse if you have large trees with thousands of samples
(post deleted by author)
Hi, there:
I’m wondering if there are more details about the subsampling scheme used for the global GISAID ncov build? Is the data selected balanced and representative enough? Really looking for your reply!
Thank you very much
Best regards,
Y.
Hi @Mer.Explore. You can see our subsampling criteria for the global ncov build here: ncov/builds.yaml at master · nextstrain/ncov · GitHub. This aims for 4000 total genomes split between older samples (Jan 2020 and 6 months previous) and more recent samples (6 months previous to today). Each continental region is equally selected from except for Oceania which has one third as many samples as other regions due to smaller population size.
Within a region we aim to equitable sample across space and time. The overall effect is to get roughly equal sample counts across countries and across months.
Hi @trvrb! Thanks a lot for your reply!! It clarifies a lot! Is this global subsampling data the one that is available on GISAID (the nextregions data in the download module)? If it is, does data in nextregions of different continents follow the same rule when subsampling?
Really looking forward to your reply! And, again, thank you very much!
Best regards,
Y,
The samples available via GISAID / Downloads / nextregions are exactly what’s available at nextstrain.org with nextregions / Global corresponding to nextstrain.org/ncov/gisaid/global/6m, nextregions / Africa corresponding to nextstrain.org/ncov/gisaid/africa/6m, etc…
The regional subsampling works a bit differently than the global subsampling. The exact details are in the same GitHub file and can be seen for Africa as region here: github.com/nextstrain/ncov/blob/master/nextstrain_profiles/nextstrain-gisaid/builds.yaml#L251. Basically this adds a further split between samples within region sampled at greater frequency relative to samples from outside the region. We have a 4:1 ratio of recent to early and a 4:1 ratio of focal region to global context.
Thanks a lot!!! It is very helpful!