Global subsampling question

Hi Nextstrain Community,
Thanks for all the continuous improvements on your site and tools. For your global ncov dataset on your website, do you use reproducible or random subsampling?
Second question, if its reproducible, you should get the same tree each time you run the pipeline? So, is my understanding correct that even if there are new sequences in Gisaid you’ll get the same tree.
Third question, if you use random subsampling, won’t you get a slightly different tree each time?
Thanks,
Saje

1 Like

our subsampling procedure has an element of randomness. You can fix the seed of the random number generator via the flag --subsample-seed, but the result will only be reproducible on the exact same input data.

even with the same sample, the result might be exactly the same since tree reconstruction involves stochastic search heuristics.

We do get slightly different trees each time.

hope this helps,
richard

Thanks for response, Richard- that was helpful.

For your ncov global data set on your website, do you use --subsample-seed?

Even if you are using the seed, my assumption is that it’s possible that many samples that were previously present in the tree are no longer present in the next build since the input data set increases each time?

no, we don’t use the seed. so trees will differ from day to day. And even if we did, as more data become available, some sequences will drop out.