Curate data from the full GISAID database

Hi, there:

For the tutorial on “Explore SARS-CoV-2 evolution”, under the section of “Preparing your data” and then “Curate data from the full GISAID database”, it says: Find the “Download packages” section and select the “FASTA” button.

My question is: why don’t we download the "MSA full XXXX" file under the “alignment and proteins” section? I think the FASTA files under “Download packages” section is not aligned.

BTW, I found there are ambiguous sequences in the FASTA file, with letters such as “k”, “y”. Is there a way to filter out sequences containing those letters?

Thank you very much & Best regards,
Jie

Hi @jiehuang001! You could potentially start with the full nucleotide MSA from GISAID instead of the unaligned sequences. We haven’t tested this approach, since we rely on features of nextalign’s alignments that are not guaranteed to be present in the GISAID alignments. If you wanted to use the pre-aligned sequences as input for your workflow, you could change the recommended builds.yaml section to look like this to skip the alignment step of the workflow:

# Define inputs for the workflow.
inputs:
  - name: subsampled-gisaid
    metadata: data/subsampled_metadata_gisaid.tsv.gz
    aligned: data/subsampled_sequences_gisaid.fasta.gz

You can filter strains whose sequences have invalid nucleotide characters with the augur filter flag --non-nucleotide. However, we do allow valid IUPAC characters (like “K” and “Y”) in our analyses. These will get inferred as the most likely nucleotide character by TreeTime during the refine step of the workflow.

Thanks!

Exactly how do I user “augur filter --non-nucleotide”? I did not find an example code. I want to REMOVE those non-nucleotide sequences, not to EXTRACT them. What if I also want to exclude those sequences with missing genotype (denoted by “-” or “n”)?

I still could not figure out how to extract sequences with alpha, beta, gamma, and delta mutations. I assume that everybody downloaded the sequences and the metadata from GISAID. However, there is not a label for such mutations in the metadata.

Your help would be greatly appreciated.

Thank you very much & Best regards,
Jie

You can filter (i.e., exclude or remove) sequences with invalid nucleotide characters like so:

augur filter --sequences input.fasta --non-nucleotide --output filtered.fasta

This will not filter out sequences with valid IUPAC characters, gaps ("-"), or “N” characters. We allow these characters in our analyses because they can be inferred later on. If you want to exclude these, you’ll need to use a different tool or write a script to filter in this way.

Regarding the different WHO lineages, you can search for these in the GISAID interface by selecting from the “Variants” search interface:

This dropdown provides a list like this:

You can also query the GISAID metadata with Augur for specific Pangolin lineages. Note that a single WHO variant can correspond to multiple Pangolin lineages. The following example shows how to query for all Lambda records from the metadata and output the corresponding sequences.

augur filter \
    --sequences sequences.fasta \
    --metadata metadata.tsv\
    --query "pangolin_lineage == 'C.37'"
    --output-metadata lambda_metadata.tsv \
    --output-sequences lambda_sequences.fasta