I recently downloaded the dengue data from Nextstrain (https://data.nextstrain.org/files/workflows/dengue/sequences_all.fasta.zst). The number of sequences there is around 52000, while on the website (auspice) only around 4000 sequences seem to be used to construct the tree. Additionally, in the website, there are also more sequences labeled as denv1 and denv2 than there are sequences when selecting all serotypes. Is this just a display error, or is the data used for the tree different than the data available for download?
Thanks for posting! Yes, the data used for the tree is different than the data available for download. The downloadable dengue data includes all curated records. However, to keep the website responsive (slows down on large trees), we subsample those records to approximately 4000 sequences for the tree visualization.
sequences are longer than a minimum length (5000nt for genome, 1000nt for E gene) and
they are sampled across time and geolocation (by year and region)
The subsampling is probabilistic, so the exact number of sequences may vary between runs. Although, in the case of denv4, for example, only 1540 sequences meet the 5000nt length requirement (based on the metadata), so fewer samples are included compared to denv1 or denv2.
Hope this helps! Let me know if you have any other questions or would like more clarification