Data displayed on nexstrain.org vs downloadable data

samf · June 4, 2025, 7:55am

Hello,

Thank you for all the amazing work you do !

I recently downloaded the dengue data from Nextstrain (https://data.nextstrain.org/files/workflows/dengue/sequences_all.fasta.zst). The number of sequences there is around 52000, while on the website (auspice) only around 4000 sequences seem to be used to construct the tree. Additionally, in the website, there are also more sequences labeled as denv1 and denv2 than there are sequences when selecting all serotypes. Is this just a display error, or is the data used for the tree different than the data available for download?

quietjen · June 4, 2025, 4:23pm

Hello,

Thanks for posting! Yes, the data used for the tree is different than the data available for download. The downloadable dengue data includes all curated records. However, to keep the website responsive (slows down on large trees), we subsample those records to approximately 4000 sequences for the tree visualization.

The subsampling is configured in defaults/config_dengue.yaml with the following criteria:

Subsample approximately 4000 sequences such that
sequences are longer than a minimum length (5000nt for genome, 1000nt for E gene) and
they are sampled across time and geolocation (by year and region)

The subsampling is probabilistic, so the exact number of sequences may vary between runs. Although, in the case of denv4, for example, only 1540 sequences meet the 5000nt length requirement (based on the metadata), so fewer samples are included compared to denv1 or denv2.

Hope this helps! Let me know if you have any other questions or would like more clarification

samf · June 5, 2025, 7:22am

That helps, thank you !

Topic		Replies	Views
Help for phylogenetic tree about Dengue Help and Getting Started	15	956	April 6, 2023
Inconclusive data by country Site Feedback	9	964	August 29, 2021
Filter for create Dengue phylogenetic tree Help and Getting Started	7	108	November 3, 2025
Is there a maximum advised tree/sample size? Help and Getting Started	1	562	March 19, 2021
Difference between sequence samples based on Dataset Help and Getting Started	1	46	September 16, 2024

Data displayed on nexstrain.org vs downloadable data

Related topics