I’m new here and I want to ask you a very beginner question.
I need to get some information about SARS-CoV-2 in the South America countries, including the number of samples deposited in the GISAID in the last 3 months and lineages present for each country. So in the Dataset I adjust to 6 month, global, data range from June 1 to August 31, colored by Pango lineage. At the end of the page I Filtered by country, starting by Brasil.
About the samples deposited, it: “Showing 26 of 3898 genomes sampled between Jun 2024 and Aug 2024”.
However, when I change the Dataset from global to south-america it shows: “Showing 395 of 2676 genomes sampled between Jun 2024 and Aug 2024”.
Why is the number of samples so different? It was 26 using the “global” Dataset but 395 using “south-america”… which one should I consider when analysing these data?
The datasets under nextstrain.org/ncov are subsampled/downsampled to reduce sampling bias and limit the number of sequences displayed for web browser performance reasons. The south-america dataset uses more samples from the focal region compared to the global dataset which aims to sample proportional to population sizes, hence the difference you observed.
To address your motivation:
I need to get some information about SARS-CoV-2 in the South America countries, including the number of samples deposited in the GISAID in the last 3 months and lineages present for each country.
I would not use the numbers from nextstrain.org/ncov due to the subsampling mentioned above. Assuming you have a GISAID account, the information you are looking for can be obtained directly through GISAID’s website with filters such as:
The bottom of the page shows the total number of sequences that match the filters. It looks like each sequence in GISAID has a pango lineage associated with it, so you could download the metadata (via Download > Input for the Augur pipeline) and inspect the pangolin_lineage column.