This would select all sequences assigned to this variant.
Depending on what data you download from GISAID, you can do similar queries on pango_lineages. Nextstrain_clade is only present in Nextstrain supplied files.
I opened the link that you provided. It seems that the metadata listed over there has less than 1 million rows, while the latest GISAID download has more than 2 million SARS-COV-2 genomes.
Based on the WHO’s webiste (Tracking SARS-CoV-2 variants), teh Delta mutation has a GISAID clade of “G/478K.V1” and a Nextstrain clade of “21A”. But, is this strictly a one-to-one relationship? That is, does all virus belonging to Nextstrain clade “21A” have Delta mutation, and vice versa? I wish there is a formula on how to classify/predict the alpha, beta, gamma, delta mutations?
Hello
I am trying to extract BA.1 (omicron…) sequences but I get zero results.
I used the following
augur filter --metadata …/_msa_20220109/data/metadata_gisaid.tsv.gz --query “”(‘Pango lineage’ == ‘BA.1’)" --exclude-ambiguous-dates-by any --output-strains ./myouput
You need to make sure the column headings match exactly. Have a look a the the metadata file itself, see what the pango_lineage column is named. I suspect pango_lineage.
Also, to verify that there are some BA.1 in there, you can use tools like tsv-filter from Tsv Utils :: Anaconda.org
thank you! this was indeed a column ID issue. I fixed it.
is there a way to query from a list of country e.g. c(“countryA”, “countryB”) instead of doing iterative queries.
I am also trying various subsetting approach. for now, it seems like we can only subset by year month for dates but not week? is it correct? any suggestions?