Build for multiple counties?

jlhudd · July 10, 2020, 10:10pm

Hi @stacia! There is a way to do this, but it isn’t clear right now. We’re working on improving the docs and interface for this type of filtering, but for now here is how you can workaround the current system.

Since v8.0.0, augur filter supports a --query argument that allows you to make pandas-style queries on the metadata columns. These queries support complex boolean logic like nested ORs and ANDs. As an example, you could run the following command to get all sequences specifically from King and Snohomish counties in Washington:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --query "(country == 'USA') & (division == 'Washington') & (location in ['King County', 'Snohomish County'])" \
  --output counties.fasta

The filter command will run the given query on the metadata data frame and return all records for which the query evaluates as true.

If you are using our Snakemake workflow and defining your builds solely through the builds.yaml file, there is no explicit subsampling rule attribute for query yet (although it’s coming soon). But, the values of existing subsampling attributes like exclude and include get passed directly to the subsampling command, so you can use these to pass whatever extra command line arguments you like to augur filter.

For example, you could define a multi-county subsampling rule that looks like this:

subsampling:
  # Default subsampling logic for multiple adjacent counties.
  multicounty:
    # Focal samples for multiple counties
    bay_area:
      group_by: "division year month"
      seq_per_group: 48
      include: --query "(country == '{country}') & (division == '{division}') & (location in [{counties}])"
    # Contextual samples from the rest of the world that are genetically similar to county samples
    global:
      group_by: "country year month"
      seq_per_group: 1
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "bay_area"

Then you can define your build like so:

builds:
  bay-area:
    # Link this build to our custom multi-county subsampling scheme. 
    subsampling_scheme: multicounty
    region: North America
    country: USA
    division: California

    # Define as many counties as you like with appropriate quoting for the pandas query.
    counties: "'Sonoma', 'Napa', 'Marin', 'Solano', 'Contra Costa', 'Alameda', 'San Francisco', 'San Mateo', 'Santa Clara', 'Santa Cruz'"

Since the subsampling rules pull in any attributes you define in the build, this example defines the usual geographic attributes plus a list of counties that can be passed into the query argument. Note that there is some Inception-style quoting going on here to keep the YAML config, the command line, and pandas happy. This could definitely be implemented more cleanly.

To test your subsampling rule, you can run the Snakemake rule to only produce the multi-county sample.

snakemake \
  --profile my_profiles/your_profile \
  results/bay-area/sample-bay_area.fasta

An alternate implementation that doesn’t require as much mind-warping quoting would be to annotate your metadata file ahead of time with a custom column for the counties of interest. You could do this as a boolean integer field, for example is_bay_area, and then your subsampling scheme would look like this with a much simpler query argument:

subsampling:
  # Default subsampling logic for multiple adjacent counties.
  multicounty:
    # Focal samples for region
    bay_area:
      group_by: "division year month"
      seq_per_group: 48
      include: --query "(is_bay_area == 1)"
    # Contextual samples for region from the rest of the world
    global:
      group_by: "country year month"
      seq_per_group: 1
      exclude: --query "(is_bay_area == 0)"
      priorities:
        type: "proximity"
        focus: "bay_area"

With this scheme, your build YAML would not have to explicitly list all of the counties. That said, I do slightly prefer the first implementation above since it more clearly documents the locations that belong in “the Bay Area” definition. Folks from San Francisco might have a different definition of “Bay Area” than someone living in the Concord, for instance.

Do you think these approaches would work for your analysis? If you do try it out, let us know if you have suggestions for how to improve the interface. Moving forward, we plan to add a query attribute for subsampling rules that will expose this kind of functionality more clearly.

Topic		Replies	Views
Subsampling and Data Download Help and Getting Started	2	564	March 19, 2021
Snakemake Q: passing list of param values @ CLI --> one build per value Help and Getting Started	4	978	November 23, 2020
Augur error while subsampling - updated Help and Getting Started	0	496	November 21, 2020
Subsampling sequences genetically related to a focal sample Help and Getting Started	0	453	January 14, 2022
How to set the multiple divisions data Help and Getting Started	1	348	November 14, 2021

Build for multiple counties?

Related topics