Data displayed on nexstrain.org vs downloadable data

Hello,

Thank you for all the amazing work you do !

I recently downloaded the dengue data from Nextstrain (https://data.nextstrain.org/files/workflows/dengue/sequences_all.fasta.zst). The number of sequences there is around 52000, while on the website (auspice) only around 4000 sequences seem to be used to construct the tree. Additionally, in the website, there are also more sequences labeled as denv1 and denv2 than there are sequences when selecting all serotypes. Is this just a display error, or is the data used for the tree different than the data available for download?

Hello,

Thanks for posting! Yes, the data used for the tree is different than the data available for download. The downloadable dengue data includes all curated records. However, to keep the website responsive (slows down on large trees), we subsample those records to approximately 4000 sequences for the tree visualization.

The subsampling is configured in defaults/config_dengue.yaml with the following criteria:

  • Subsample approximately 4000 sequences such that
  • sequences are longer than a minimum length (5000nt for genome, 1000nt for E gene) and
  • they are sampled across time and geolocation (by year and region)

The subsampling is probabilistic, so the exact number of sequences may vary between runs. Although, in the case of denv4, for example, only 1540 sequences meet the 5000nt length requirement (based on the metadata), so fewer samples are included compared to denv1 or denv2.

Hope this helps! Let me know if you have any other questions or would like more clarification

That helps, thank you !