Select SARS-COV-2 sequence with alpha, beta, gamma, delta mutations

jiehuang001 · July 11, 2021, 6:18am

Hi, After I downloaded the full genome from GISAID, could I run a query to extract all genomes with alpha, beta, gamma, delta mutations?

Right now, it seems that i could only use “==” inside a query, but not something like “like”.

Please advise.

Thank you & best regards,
Jie

rneher · July 12, 2021, 12:05pm

hi @jiehuang001, if you are using the open metadata (see here https://nextstrain.org/blog/2021-07-08-ncov-open-announcement) you can do things like

--query "Nextstrain_clade=='21A (Delta)'"

This would select all sequences assigned to this variant.

Depending on what data you download from GISAID, you can do similar queries on pango_lineages. Nextstrain_clade is only present in Nextstrain supplied files.

jiehuang001 · July 24, 2021, 6:59am

Thanks, Meher!

I opened the link that you provided. It seems that the metadata listed over there has less than 1 million rows, while the latest GISAID download has more than 2 million SARS-COV-2 genomes.

Based on the WHO’s webiste (Tracking SARS-CoV-2 variants), teh Delta mutation has a GISAID clade of “G/478K.V1” and a Nextstrain clade of “21A”. But, is this strictly a one-to-one relationship? That is, does all virus belonging to Nextstrain clade “21A” have Delta mutation, and vice versa? I wish there is a formula on how to classify/predict the alpha, beta, gamma, delta mutations?

Best regards,
Jie

rneher · July 26, 2021, 8:30pm

The covariants.org website provides a list of mutations that define various VoCs:
CoVariants

antoine · January 9, 2022, 6:31pm

Hello
I am trying to extract BA.1 (omicron…) sequences but I get zero results.
I used the following
augur filter --metadata …/_msa_20220109/data/metadata_gisaid.tsv.gz --query “”(‘Pango lineage’ == ‘BA.1’)" --exclude-ambiguous-dates-by any --output-strains ./myouput

any suggestion or alternative?
thank you!

corneliusroemer · January 18, 2022, 6:57pm

antoine:

augur filter --metadata …/_msa_20220109/data/metadata_gisaid.tsv.gz \
--query “”(‘Pango lineage’ == ‘BA.1’)" \
--exclude-ambiguous-dates-by any --output-strains ./myouput

You need to make sure the column headings match exactly. Have a look a the the metadata file itself, see what the pango_lineage column is named. I suspect pango_lineage.

Also, to verify that there are some BA.1 in there, you can use tools like tsv-filter from Tsv Utils :: Anaconda.org

antoine · January 18, 2022, 9:00pm

thank you! this was indeed a column ID issue. I fixed it.
is there a way to query from a list of country e.g. c(“countryA”, “countryB”) instead of doing iterative queries.
I am also trying various subsetting approach. for now, it seems like we can only subset by year month for dates but not week? is it correct? any suggestions?

thank you again

Topic		Replies	Views
Regarding Extracting Nucleotide Mutations General	7	580	June 25, 2021
Resource for representative nucleotide changes for Nextstrain clades General	1	43	August 2, 2024
Updated example command needed for updated GISAID file	4	564	August 30, 2021
From a newbie: difficulty finding multiple coincident mutations in spike	1	386	March 8, 2021
Guide to filtering GISAID data for division-specific SARS-CoV-2 builds Help and Getting Started	3	1512	April 17, 2024

Select SARS-COV-2 sequence with alpha, beta, gamma, delta mutations

Related topics