Inconclusive data by country

I am trying to see all SARS-CoV-2 strains caught in Belarus and it is puzzling for me that global dataset contains fewer points than local.

Europe Belarus, 9 points, https://nextstrain.org/ncov/gisaid/europe?f_country=Belarus
Global Belarus, 3 points, https://nextstrain.org/ncov/gisaid/global?f_country=Belarus
neherlab Belarus, 39 data points, https://nextstrain.org/groups/neherlab/ncov/belarus?f_country=Belarus

Why so much difference?
Where to get the full picture?

(sorry for the low quality markup and the image - I can only post one screen and two links as a new user)

There is a maximum size to the trees to keep it viewable and interpretable (around 3000 samples). So most of the trees you see have had samples sub-sampled. The larger a region the tree covers, the heavier subsampling would have to happen. So the global tree would have gone through heavier subsampling than the Europe tree. I don’t know about subsampling in the Neher Lab tree since it only has 1920 total samples.

If you go on Gisaid and search with filtering by country, you can see how many publicly available samples there are. Gisaid also report lineages.

In terms of an existing tree build that contains all Belarus data, I will leave this question to others to answer.

Thanks for the detailed info.

It could help to know how many points are missing from the data representation. My goal was to see how many variants of Delta were captured in Belarus, and global subsampling failed to cover Delta strain here at all.

It is also not clear - if the limit is 3000, why subsampling it is not extended when a more specific filter is selected?

I feel tree is not the best approach for your purpose. I would really go on Gisaid, search by location, and download the metadata for all sequences from Belarus and count how many are Delta.

I filled the registration form on GISAID, and confirmed my registration on 16 Mar 2020, but my account is still not activated. I guess their open access is only for privileged institutions, so I am using Nextstrain as my source of data about what’s going on. Thought maybe I could get something more interesting than https://city.opendata.by/ for our open data community.

Try NCBI: NCBI Virus
Or The European Nucleotide Archive (ENA)
ENA Browser
(NCBI imports from ENA so don’t add their data together)

(This is getting off topic for Nextstrain. If you need more help, prob try on NCBI or ENA sites. Good luck!!)

@abitrolly, if you registered this long ago, your registration may just be lost - there was a lot going on in March 2020. You might want to try registering again, or using the contact form to get in touch and explain the situation. (Of course this is assuming you haven’t tried this already - I’m sorry if you have and it hasn’t worked!)

Because GISAID does not provide open data access to its database, it will be hard for people to validate the dataset, and for me to keep it updated. Data export from Nextstrain with ids and dates is enough for my purpose. If the data was complete.

I am not sure if there is the exact term for the complete data. For me such data should come with metrics if any subsampling, or multiple databases with duplicate entries are involved. The metrics I am interested is how many datapoints total are in Nexstrain dataset, how many are selected and excluded by filters and subsampling constraints. If there are multiple databases like GISAID, NCBI, ENA, how datapoints from them are reconciled (merged and duplicates removed).

This would give me the acceptable quality of a dataset for visualization, and I assumed that Nextstrain already does this processing, but doesn’t expose the stats.

Hi @abitrolly, the main build you are probably accessing at Nextstrain.org (auspice) uses data entirely from GISAID. You can see this in the URL and in the flag at the top. We have another link which shows data from NCBI: auspice. We don’t pull data from ENA (this would require assembling the genomes). So, we keep each data source separate and don’t combine this data.

You can find details of the filtering and subsampling in the code. For subsampling, we sample based on geography and time, with a focus on later sequences. For regional builds, these are focused on the region in question. You can find the code for the subsampling here.
The filtering steps are more numerous I’m afraid. You can find a few of them here and here. All of this is done each time the dataset is run, so the results will differ each time.

I hope it helps!

2 Likes

That helps a lot! Thank you.

Some more thoughts.

image

The header could help to quicker understand Nextstrain data model if the sampled word was an active link that shows an infobox on hover, listing the info.

  • explain Nextstrain limitation (Nextstrain widget is limited to 3000 datapoints, so we’ve tried to select the most relevant)
  • which dataset(s) were used (using NCBI dataset, other datasets GISAID, ENA)
    • NCBI here may be an active link containing hash and date of the dataset
  • how many data points is each dataset originally and combined (containing 10000 raw data points)
  • how many data points are left after refining (9999 data points after refining)
    • removed duplicates
    • other quality checks
  • how many points are selected for the current division (8888 data points in Europe subdivision)
    • subdivision could be a link to lsit all processed subdivisions
  • subsampling params (3000 points selected by “algorithm/tool”, random seed if needed fo reproducibility)
    • “algorithm/tool” is the actual name of the algorithm or tool used, and also an active link to the description of subsampling algorithm and how to reproduce results
  • credits link (list all authors in subdivision, and in sampled data)
  • an invitation to “tweet” in this forum if the infobox content was useful or not