Subsampling and Data Download

Hello,

I am working on setting up a local build for my state PHL and would greatly appreciate any insight from the Nextstrain team and/or individuals who have set up local builds for their region/state.

My first question is what exactly is happening during the subsampling step for the global builds and the regional/state specfic builds available on the Nextstrain website. For example, the Iowa focused subsampling build maintained by CDC/AMD has 177 Iowa isolates, whereas GISAID has roughly 250 for Iowa for the same time frame while also filtering for complete and high coverage sequences. What additional filtering might be going on here?

Secondly, is there a way to easily download a multifasta file from Nexstrain for a given build. I see the option to download the metadata, but was wondering if there is at least a way to then grab the sequences from GISAID if their user aggrement prevents downloading fasta’s from Nextstrain directly. My resson for doing this is that I would like a current representative sample of national/international sequnces to provide context for the local sequences I would include in my local build.

Finally, for those that may have already solved the above issue. How often are you downloading a new set of national/international sequences for your local build?

Thanks,
Wes

These are great questions, @whottel! I’ll try to answer the first two and leave the last for other PHL folks.

My first question is what exactly is happening during the subsampling step for the global builds and the regional/state specfic builds available on the Nextstrain website.

The subsampling rules for the CDC/AMD builds are the same across all states. These builds try to use 1,000 samples per focal state, 800 samples from the rest of the USA, and 800 samples from the rest of the world. Looking at GISAID metadata and sequences I have locally (about a week old now), I see 238 samples with division='Iowa' in metadata and 211 that pass the current filter step. I’m not sure why there are only 163 Iowan samples in the latest CDC/AMD build though. One possibility is that some samples are dropped by the refine rule’s clock filter for having more mutations than expected by the clock rate.

The subsampling rules for Nextstrain’s global and regional builds are much more complicated. These rules subsample by time and geography including fewer samples per region from more than 4 months ago and more samples from the last 4 months. This approach tries to balance the need for a reasonably-sized tree with historical context from the past and as many recent samples as possible.

Secondly, is there a way to easily download a multifasta file from Nexstrain for a given build.

We can’t provide direct links to sequences used for builds, but you can download the metadata file that you mentioned and use the “gisaid_epi_isl” column to select the records you want from the GISAID search interface.

We are also discussing ways we could provide a collection of global context sequences for users to download. We haven’t quite figured out how that will work though.

@jlhudd Thanks for the explanation!

-Wes