Is there a maximum advised tree/sample size?

l.hughes · March 18, 2021, 7:56am

Hello,
I am just starting work on setting up a new local build. I am likely to have access to a high number of sequences for different divisions of the country in the near future.

I understand that there are subsampling protocols in the workflow, but I wonder if there are some recommendations about the maximum number of samples/sequences that can be included overall? Also, what might be the best strategy for determining an appropriate subsampling protocol for a country?

Thank you for your help! Liane

jlhudd · March 19, 2021, 5:53pm

Welcome @l.hughes! Technically, there isn’t a hard limit on how many samples you include in your tree, but we’ve found that with more than 10,000 samples our builds take much longer to run and Auspice (the visualization tool that powers nextstrain.org’s pathogen views) slows down to the point where it is nearly unusable. The number of samples you include depends on the questions you’re trying to answer with your tree(s), so even though 10,000 might be an upper bound, you may only need a fraction of that many samples to answer your questions. For the Nextstrain SARS-CoV-2 builds, we target ~4,000 samples per tree.

Choosing a subsampling protocol is a major open question in the field. One common approach for region-specific analyses (e.g., country- or state-level analyses) is to maximize the number of samples from the region of interest and add as many contextual samples from the rest of the world as possible to run your maximum preferred number of samples. For example, see the CDC/SPHERES state- and territory-level subsampling scheme where each build requests 1000 samples per focal state/territory, 800 contextual samples from the rest of the USA, and 800 contextual samples from the rest of the world. The Nextstrain builds use a more complicated temporal and geographic subsampling to select fewer sequences from earlier in the pandemic and more recent sequences. In both of these cases, the specific subsampling protocol reflects a process of trial and error based on expert review.

The subsampling question is one that @alliblk and @wcassias have both been thinking about a lot, too, and they may have additional suggestions.

Topic		Replies	Views
Subsampling and Data Download Help and Getting Started	2	564	March 19, 2021
Iterative use of nextstrain or parameters tuning Help and Getting Started	1	384	July 26, 2021
Subsampling sequences genetically related to a focal sample Help and Getting Started	0	453	January 14, 2022
Data displayed on nexstrain.org vs downloadable data Help and Getting Started	2	21	June 5, 2025
Regarding Build for USA- Missing Data Help and Getting Started	9	540	October 27, 2021

Is there a maximum advised tree/sample size?

Related topics