Setting up context sequence set

Perhaps a bit of a strange question, but I note that NextStrain maintains a set of context sequences for use via GISAID. I was wondering whether it was possible for you describe exactly how this context sequence set is generated from GISAID data? Is it just a random global subsampling, or is it more structured than this?

Thank you for all your help!

Hi @l.hughes – in order to produce our global and 6 regional builds (available here) we subsample the GISAID dataset according to these rules. Briefly, for the global dataset we partition the data into groups based on geography and sampling date, and then sample from those groups randomly. For the regional builds, we take a similar approach for samples from that region and then prioritise out-of-region samples based on genetic proximity to the in-region samples. We also split sequences based on an (arbitrary) date of 4 months ago, and tend to take more sequences from the more recent group here. Finally, we have a small force-include list, largely for tree rooting purposes. Note that the inherent stochasticity of this approach means that each time we generate a dataset it will have different samples in it.

1 Like