Select SARS-COV-2 sequence with alpha, beta, gamma, delta mutations

Hi, After I downloaded the full genome from GISAID, could I run a query to extract all genomes with alpha, beta, gamma, delta mutations?

Right now, it seems that i could only use “==” inside a query, but not something like “like”.

Please advise.

Thank you & best regards,
Jie

hi @jiehuang001, if you are using the open metadata (see here https://nextstrain.org/blog/2021-07-08-ncov-open-announcement) you can do things like

--query "Nextstrain_clade=='21A (Delta)'"

This would select all sequences assigned to this variant.

Depending on what data you download from GISAID, you can do similar queries on pango_lineages. Nextstrain_clade is only present in Nextstrain supplied files.

Thanks, Meher!

I opened the link that you provided. It seems that the metadata listed over there has less than 1 million rows, while the latest GISAID download has more than 2 million SARS-COV-2 genomes.

Based on the WHO’s webiste (Tracking SARS-CoV-2 variants), teh Delta mutation has a GISAID clade of “G/478K.V1” and a Nextstrain clade of “21A”. But, is this strictly a one-to-one relationship? That is, does all virus belonging to Nextstrain clade “21A” have Delta mutation, and vice versa? I wish there is a formula on how to classify/predict the alpha, beta, gamma, delta mutations?

Best regards,
Jie

The covariants.org website provides a list of mutations that define various VoCs:
CoVariants

Hello
I am trying to extract BA.1 (omicron…) sequences but I get zero results.
I used the following
augur filter --metadata …/_msa_20220109/data/metadata_gisaid.tsv.gz --query “”(‘Pango lineage’ == ‘BA.1’)" --exclude-ambiguous-dates-by any --output-strains ./myouput

any suggestion or alternative?
thank you!

You need to make sure the column headings match exactly. Have a look a the the metadata file itself, see what the pango_lineage column is named. I suspect pango_lineage.

Also, to verify that there are some BA.1 in there, you can use tools like tsv-filter from Tsv Utils :: Anaconda.org

thank you! this was indeed a column ID issue. I fixed it.
is there a way to query from a list of country e.g. c(“countryA”, “countryB”) instead of doing iterative queries.
I am also trying various subsetting approach. for now, it seems like we can only subset by year month for dates but not week? is it correct? any suggestions?

thank you again