Guide to filtering GISAID data for division-specific SARS-CoV-2 builds

Background

One common use case for SARS-CoV-2 analyses is to create a phylogeny focused on a specific place and time with contextual sequences that are genetically related to sequences in the focal region. Although it’s technically possible to perform this analysis with the full GISAID data as input and a proximity-based subsampling scheme, the full GISAID database is so large that this analysis isn’t practical to run regularly.

An alternative approach is to pre-filter the GISAID database to the sequences you want to use as contextual data, apply proximity-based subsampling to these data, and then pull in a pre-built global context from GISAID’s “nextregions” files (available under “Downloads” → “Genomic Epidemiology” → “nextregions”).

The instructions below show a faster way to pre-filter GISAID data than the current data preparation guide in the SARS-CoV-2 documentation. These instructions will eventually replace what is shown in that guide. We’ll also include a full build configuration file that uses these pre-filtered data and a global “nextregions” dataset to perform a state-specific build for a US state of Idaho.

The approach we use below involves the following steps:

  1. Filter GISAID metadata to the records we want based on geography.
  2. Extract strain names for records that pass all filters.
  3. Extract sequences that correspond to selected strain names.

Curating data from GISAID

To start, we assume you have downloaded the full metadata and sequences tar balls from GISAID’s downloads, as described in the data preparation guide. If you don’t see buttons for these files, use GISAID’s contact form to request access to them. The commands below assume the metadata are saved in the ncov workflow directory as data/metadata_tsv.tar.xz and the sequences in data/sequences_fasta.tar.xz.

We also assume you have already installed Nextstrain tools into a Conda environment.

First, install the additional tools we will need to filter metadata and extract sequences.

# Install tsvutils and UCSC command to extract sequences.
# You only need to do this once.
conda activate nextstrain
conda install -c conda-forge -c bioconda tsv-utils ucsc-fasomerecords

Select all metadata from the USA by filtering on the GISAID Location field.

# Select all metadata from the USA.
tar zxOf data/metadata_tsv.tar.xz metadata.tsv \
  | tsv-filter -H --str-in-fld Location:USA \
  | xz -c -2 > data/metadata_usa.tsv.xz

Count the number of records, to see how many strains we’ve selected.

xz -c -d data/metadata_usa.tsv.xz | wc -l

At the time of writing, there are ~2.5 million records from the USA in GISAID. We should also check the number of records from the state of interest, Idaho.

xz -c -d data/metadata_usa.tsv.xz \
  | tsv-filter -H --str-in-fld Location:Idaho \
  | wc -l

There are 18,626 records from Idaho, too many to include in a single phylogeny that will run in a reasonable amount of time. We are faced with two problems:

  1. We need to align ~2.5 million genomes, so we can calculate genetic proximity of contextual strains in the US to the focal set in Idaho.
  2. We need to subset our data in Idaho to a reasonable amount for a Nextstrain phylogeny.

For the moment, we will assume that we don’t mind waiting several hours to align 2.5 million genomes on our laptop and that we will define our build configuration to limit the number of Idaho strains to a reasonable number. After this initial work, we’ll see how we can use open GenBank-sourced data curated by Nextstrain to avoid this extra work.

Note: For this example build, we know we want to limit records to those collected up to June 1, 2021. This detail will appear in the build configurations below.

Next, we extract the names of the strains from the metadata along with the collection and submission dates for each record. We do this because GISAID sequences are stored by strain name, collection date, and submission date like hCoV-19/USA/CO-CDPHE-2102044933/2021|2021-10-15|2021-11-24. Most tools that extract sequences from a FASTA file by name will match the full FASTA definition line, so we build our strain list to match this pattern.

# Get strain names for genomes.
# GISAID uses virus name, collection date, and submission date
# delimited by a pipe character.
xz -c -d data/metadata_usa.tsv.xz \
  | tsv-select -H -f 'Virus\ name','Collection\ date','Submission\ date' \
  | sed 1d \
  | sed 's/\t/|/g' > data/strains_usa.txt

Extract the sequences that correspond to those strain names into a separate compressed FASTA file. At the time of writing, this command took ~23 minutes to scan the entire GISAID database.

# Get sequences for strain names from tarball.
tar xzOf data/sequences_fasta.tar.xz sequences.fasta \
  | faSomeRecords /dev/stdin data/strains_usa.txt /dev/stdout \
  | xz -c -2 > data/sequences_usa.fasta.xz

We now have metadata and sequences for the entire USA. We also want some contextual data to provide structure to our phylogeny that represents what has been circulating globally. A quick way to get this type of contextual data is to download a “nextregions” dataset from GISAID that corresponds to one of the seven major regional builds on Nextstrain’s SARS-CoV-2 page. Follow the instructions in the data preparation guide to download these nextregions data. For this example, we will use the “North America” dataset and save the tarball as data/hcov_north-america.tar.gz in our ncov workflow directory.

Next, we need to define a build configuration that uses these data we have curated. For this example, we will save the YAML file below as builds_idaho.yaml. See the inline comments for details.

# Define our input data, assigning a name to each dataset.
# We will reference these names in the subsampling scheme
# below, as a way to select specific strains for the analysis.
inputs:
  # These are the USA-specific data we manually curated.
  - name: usa
    metadata: data/metadata_usa.tsv.xz
    sequences: data/sequences_usa.fasta.xz
  # These are the "nextregions" data we downloaded from GISAID
  # that will provide our phylogenetic context. Note that the
  # workflow knows how to extract the corresponding metadata
  # and sequences from these tarballs, so you don't need to do
  # that step manually.
  - name: nextregions
    metadata: data/hcov_north-america.tar.gz
    sequences: data/hcov_north-america.tar.gz
  # We force include reference sequences used for alignment
  # and rooting of the time tree. Although these should be
  # in the "nextregions" data already, it is safest to always
  # include these explicitly.
  - name: references
    metadata: data/references_metadata.tsv
    sequences: data/references_sequences.fasta

# Define a single build for the state of interest, Idaho.
# The build name will be "idaho" and it will use the custom
# subsampling scheme defined below.
builds:
  idaho:
    subsampling_scheme: idaho_scheme

# Define a single subsampling scheme for the state of Idaho.
# This analysis is for a specific date range, so we specify
# the same maximum collection date for strains in all sections
# of the subsampling scheme below.
subsampling:
  idaho_scheme:
    # Include all data from Idaho that are found in the "usa"
    # input. When the workflow merges metadata from multiple
    # inputs, it creates a boolean column for each input to
    # indicate which input each record came from. A record
    # from the "usa" input will have a value of "yes" in a
    # column named "usa". The same record will have a column
    # for the "nextregions" input with a value of "no".
    idaho:
      query: --query "(usa == 'yes') & (division == 'Idaho')"
      max_date: --max-date 2021-06-01
      # Limit the number of Idaho records included in the
      # analysis to a reasonable but large number. Tune this
      # number alone with the other "max_sequences" in the
      # sections below to keep your final build to <10,000
      # records.
      max_sequences: 5000
    # To understand transmission patterns within the US that
    # led to introductions to Idaho, we select a subset of USA
    # data from states other than Idaho with priority given to
    # strains that are genetically similar to the strains in
    # the "idaho" subsampling set defined above.
    usa:
      query: --query "(usa == 'yes') & (division != 'Idaho')"
      max_date: --max-date 2021-06-01
      # This value sets a hard upper limit on how many strains
      # make it into the analysis. Tune this value, based on
      # your needs for the resulting tree.
      max_sequences: 1000
      # These group-by columns attempt to evenly sample across
      # US states by year and month. Sequences in each group
      # of state, year, and month are prioritized by genetic
      # proximity.
      group_by: division year month
      priorities:
        type: proximity
        focus: idaho
    # Select a subset of data from the "nextregions" for
    # context. This example prioritizes strains that are
    # genetically related to the "idaho" subsampling set, but
    # you can remove the "priorities" block to get a random
    # global context instead.
    global:
      query: --query "(nextregions == 'yes')"
      max_date: --max-date 2021-06-01
      # As with the contextual data from the US above, tune 
      # this value to get a reasonable number of strains in
      # your build.
      max_sequences: 1000
      priorities:
        type: proximity
        focus: idaho

The resulting workflow for this build configuration will look like this.

Run the workflow either with the Nextstrain CLI and Docker:

nextstrain build --cpus 4 . --configfile builds_idaho.yaml

Or, with Snakemake and the ncov workflow’s Conda environment:

snakemake --use-conda -j 4 --configfile builds_idaho.yaml

The resulting Auspice tree will live in auspice/ncov_idaho.json. You can drag-and-drop this tree file onto auspice.us or view it locally with the Nextstrain CLI with:

nextstrain view auspice/

However, we noted earlier that 2.5 million genomes are a lot to align. This number will continue to increase over time, so the approach described above will not scale well in the future.

Yet another approach: using “open” data from GenBank

An alternative approach that will work for locales that deposit data into INSDC databases like GenBank, in addition to GISAID, is to work from Nextstrain’s curated versions of these open data. As these data are fully open, the Nextstrain team can share sanitized metadata and pre-aligned genomes through data.nextstrain.org. Working from these data, you can skip manual filtering of the metadata and the lengthy alignment step of the workflow.

The following new build configuration, builds_open_idaho.yaml, defines a single input dataset from data.nextstrain.org. The workflow skips sanitizing metadata (this is done by Nextstrain curation scripts ahead of time) and skips alignment (sequences are also already aligned ahead of time). The other big difference with this approach is the subsampling scheme: the “global” context now samples evenly across all countries, years, and months to build the background context that we previously defined with the “nextregions” dataset.

# Define our input data. In this case, we use the full "open"
# dataset curated by Nextstrain and allow the subsampling
# scheme to perform all the required filtering on this single
# dataset.
inputs:
  - name: open
    metadata: https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz
    aligned: https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz
    skip_sanitize_metadata: true

# Use the GenBank reference strain to root the time tree.
refine:
  root: "Wuhan-Hu-1/2019"

# Define a single build for the state of interest, Idaho.
# The build name will be "idaho" and it will use the custom
# subsampling scheme defined below.
builds:
  idaho:
    subsampling_scheme: idaho_scheme

# Define a single subsampling scheme for the state of Idaho.
# This analysis is for a specific date range, so we specify
# the same maximum collection date for strains in all sections
# of the subsampling scheme below.
subsampling:
  idaho_scheme:
    # Include data from Idaho. This is the "focal" set.
    idaho:
      query: --query "(country == 'USA') & (division == 'Idaho')"
      max_date: --max-date 2021-06-01
      # Limit the number of Idaho records included in the
      # analysis to a reasonable but large number. Tune this
      # number alone with the other "max_sequences" in the
      # sections below to keep your final build to <10,000
      # records.
      max_sequences: 5000
    # To understand transmission patterns within the US that
    # led to introductions to Idaho, we select a subset of USA
    # data from states other than Idaho with priority given to
    # strains that are genetically similar to the strains in
    # the "idaho" subsampling set defined above.
    usa:
      query: --query "(country == 'USA') & (division != 'Idaho')"
      max_date: --max-date 2021-06-01
      # This value sets a hard upper limit on how many strains
      # make it into the analysis. Tune this value, based on
      # your needs for the resulting tree.
      max_sequences: 1000
      # These group-by columns attempt to evenly sample across
      # US states by year and month. Sequences in each group
      # of state, year, and month are prioritized by genetic
      # proximity.
      group_by: division year month
      priorities:
        type: proximity
        focus: idaho
    # Select a subset of global data for context. This example
    # prioritizes strains that are genetically related to the
    # "idaho" subsampling set, but you can remove the
    # "priorities" block to get a random global context 
    # instead.
    global:
      query: --query "(country != 'USA')"
      max_date: --max-date 2021-06-01
      # Evenly sample across space and time.
      group_by: country year month
      # As with the contextual data from the US above, tune
      # this value to get a reasonable number of strains in
      # your build.
      max_sequences: 1000
      priorities:
        type: proximity
        focus: idaho

Run the new workflow either with the Nextstrain CLI and Docker:

nextstrain build --cpus 4 . --configfile builds_open_idaho.yaml

Or, with Snakemake and the ncov workflow’s Conda environment:

snakemake --use-conda -j 4 --configfile builds_open_idaho.yaml

As before, the resulting Auspice tree will live in auspice/ncov_idaho.json. You can drag-and-drop this tree file onto auspice.us or view it locally with the Nextstrain CLI with:

nextstrain view auspice/

Closing thoughts

Even the open-data-based workflow remains somewhat computationally expensive, since it has to calculate genetic distances between all sequences in the input alignment and the focal “idaho” sequences. You can speed up this workflow substantially, by removing the “priorities” blocks from the subsampling schemes in the build config above. The trade-off is that your contextual data will be randomly sampled and you may miss closely-related strains to the focal set that represent potential introductions.

2 Likes

Thanks @jlhudd! This is very useful. I do the same pre-filtering of the entire Gisaid metadata and fasta files using R and Rsamtools. But it takes a couple of hours to index and filter the files.