Welcome @l.hughes! Technically, there isn’t a hard limit on how many samples you include in your tree, but we’ve found that with more than 10,000 samples our builds take much longer to run and Auspice (the visualization tool that powers nextstrain.org’s pathogen views) slows down to the point where it is nearly unusable. The number of samples you include depends on the questions you’re trying to answer with your tree(s), so even though 10,000 might be an upper bound, you may only need a fraction of that many samples to answer your questions. For the Nextstrain SARS-CoV-2 builds, we target ~4,000 samples per tree.
Choosing a subsampling protocol is a major open question in the field. One common approach for region-specific analyses (e.g., country- or state-level analyses) is to maximize the number of samples from the region of interest and add as many contextual samples from the rest of the world as possible to run your maximum preferred number of samples. For example, see the CDC/SPHERES state- and territory-level subsampling scheme where each build requests 1000 samples per focal state/territory, 800 contextual samples from the rest of the USA, and 800 contextual samples from the rest of the world. The Nextstrain builds use a more complicated temporal and geographic subsampling to select fewer sequences from earlier in the pandemic and more recent sequences. In both of these cases, the specific subsampling protocol reflects a process of trial and error based on expert review.
The subsampling question is one that @alliblk and @wcassias have both been thinking about a lot, too, and they may have additional suggestions.