Difference between sequence samples based on Dataset

neto · September 15, 2024, 6:44pm

Dear,

I’m new here and I want to ask you a very beginner question.

I need to get some information about SARS-CoV-2 in the South America countries, including the number of samples deposited in the GISAID in the last 3 months and lineages present for each country. So in the Dataset I adjust to 6 month, global, data range from June 1 to August 31, colored by Pango lineage. At the end of the page I Filtered by country, starting by Brasil.

About the samples deposited, it: “Showing 26 of 3898 genomes sampled between Jun 2024 and Aug 2024”.

However, when I change the Dataset from global to south-america it shows: “Showing 395 of 2676 genomes sampled between Jun 2024 and Aug 2024”.

Why is the number of samples so different? It was 26 using the “global” Dataset but 395 using “south-america”… which one should I consider when analysing these data?

Thank you for your help and time

victorlin · September 16, 2024, 5:42pm

Hi @neto,

The datasets under nextstrain.org/ncov are subsampled/downsampled to reduce sampling bias and limit the number of sequences displayed for web browser performance reasons. The south-america dataset uses more samples from the focal region compared to the global dataset which aims to sample proportional to population sizes, hence the difference you observed.

To address your motivation:

I need to get some information about SARS-CoV-2 in the South America countries, including the number of samples deposited in the GISAID in the last 3 months and lineages present for each country.

I would not use the numbers from nextstrain.org/ncov due to the subsampling mentioned above. Assuming you have a GISAID account, the information you are looking for can be obtained directly through GISAID’s website with filters such as:

GISAID filters

The bottom of the page shows the total number of sequences that match the filters. It looks like each sequence in GISAID has a pango lineage associated with it, so you could download the metadata (via Download > Input for the Augur pipeline) and inspect the pangolin_lineage column.

I hope this helps!

– Victor

Topic		Replies	Views
Inconclusive data by country Site Feedback	9	952	August 29, 2021
SARS-CoV-2 lineage data download	1	379	April 18, 2023
How to download variant data? General	1	406	April 27, 2021
GISAID database	7	555	July 29, 2022
Guide to filtering GISAID data for division-specific SARS-CoV-2 builds Help and Getting Started	3	1511	April 17, 2024

Difference between sequence samples based on Dataset

Related topics