Setting up context sequence set

l.hughes · May 10, 2021, 1:04pm

Perhaps a bit of a strange question, but I note that NextStrain maintains a set of context sequences for use via GISAID. I was wondering whether it was possible for you describe exactly how this context sequence set is generated from GISAID data? Is it just a random global subsampling, or is it more structured than this?

Thank you for all your help!

james · May 10, 2021, 9:15pm

Hi @l.hughes – in order to produce our global and 6 regional builds (available here) we subsample the GISAID dataset according to these rules. Briefly, for the global dataset we partition the data into groups based on geography and sampling date, and then sample from those groups randomly. For the regional builds, we take a similar approach for samples from that region and then prioritise out-of-region samples based on genetic proximity to the in-region samples. We also split sequences based on an (arbitrary) date of 4 months ago, and tend to take more sequences from the more recent group here. Finally, we have a small force-include list, largely for tree rooting purposes. Note that the inherent stochasticity of this approach means that each time we generate a dataset it will have different samples in it.

Topic		Replies	Views
GISAID database	7	591	July 29, 2022
Guide to filtering GISAID data for division-specific SARS-CoV-2 builds Help and Getting Started	3	1631	April 17, 2024
Subsampling and Data Download Help and Getting Started	2	594	March 19, 2021
Using Genomic Epidemiology from GISAID Help and Getting Started	1	369	November 5, 2021
Tutorial/example for combining new sequences with GISAID data	2	495	October 4, 2021

Setting up context sequence set

Related topics