Build for multiple counties?

Hello, I’d like to create a build for all of the Bay Area counties in CA (5 of them). Can one add multiple counties to the “division” line in the builds.yaml file? Right now we’re grepping for them from the metadata file and adding them to the “include” file. I’d like to have all genomes from the bay area and then subsampled from rest of tree. Also wondering if it’s possible to construct a build that has all genomes from a certain region, and then all of their least common ancestors (and that’s all)?
Thanks!

Hi @stacia! There is a way to do this, but it isn’t clear right now. We’re working on improving the docs and interface for this type of filtering, but for now here is how you can workaround the current system.

Since v8.0.0, augur filter supports a --query argument that allows you to make pandas-style queries on the metadata columns. These queries support complex boolean logic like nested ORs and ANDs. As an example, you could run the following command to get all sequences specifically from King and Snohomish counties in Washington:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --query "(country == 'USA') & (division == 'Washington') & (location in ['King County', 'Snohomish County'])" \
  --output counties.fasta

The filter command will run the given query on the metadata data frame and return all records for which the query evaluates as true.

If you are using our Snakemake workflow and defining your builds solely through the builds.yaml file, there is no explicit subsampling rule attribute for query yet (although it’s coming soon). But, the values of existing subsampling attributes like exclude and include get passed directly to the subsampling command, so you can use these to pass whatever extra command line arguments you like to augur filter.

For example, you could define a multi-county subsampling rule that looks like this:

subsampling:
  # Default subsampling logic for multiple adjacent counties.
  multicounty:
    # Focal samples for multiple counties
    bay_area:
      group_by: "division year month"
      seq_per_group: 48
      include: --query "(country == '{country}') & (division == '{division}') & (location in [{counties}])"
    # Contextual samples from the rest of the world that are genetically similar to county samples
    global:
      group_by: "country year month"
      seq_per_group: 1
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "bay_area"

Then you can define your build like so:

builds:
  bay-area:
    # Link this build to our custom multi-county subsampling scheme. 
    subsampling_scheme: multicounty
    region: North America
    country: USA
    division: California

    # Define as many counties as you like with appropriate quoting for the pandas query.
    counties: "'Sonoma', 'Napa', 'Marin', 'Solano', 'Contra Costa', 'Alameda', 'San Francisco', 'San Mateo', 'Santa Clara', 'Santa Cruz'"

Since the subsampling rules pull in any attributes you define in the build, this example defines the usual geographic attributes plus a list of counties that can be passed into the query argument. Note that there is some Inception-style quoting going on here to keep the YAML config, the command line, and pandas happy. This could definitely be implemented more cleanly.

To test your subsampling rule, you can run the Snakemake rule to only produce the multi-county sample.

snakemake \
  --profile my_profiles/your_profile \
  results/bay-area/sample-bay_area.fasta

An alternate implementation that doesn’t require as much mind-warping quoting would be to annotate your metadata file ahead of time with a custom column for the counties of interest. You could do this as a boolean integer field, for example is_bay_area, and then your subsampling scheme would look like this with a much simpler query argument:

subsampling:
  # Default subsampling logic for multiple adjacent counties.
  multicounty:
    # Focal samples for region
    bay_area:
      group_by: "division year month"
      seq_per_group: 48
      include: --query "(is_bay_area == 1)"
    # Contextual samples for region from the rest of the world
    global:
      group_by: "country year month"
      seq_per_group: 1
      exclude: --query "(is_bay_area == 0)"
      priorities:
        type: "proximity"
        focus: "bay_area"

With this scheme, your build YAML would not have to explicitly list all of the counties. That said, I do slightly prefer the first implementation above since it more clearly documents the locations that belong in “the Bay Area” definition. Folks from San Francisco might have a different definition of “Bay Area” than someone living in the Concord, for instance. :wink:

Do you think these approaches would work for your analysis? If you do try it out, let us know if you have suggestions for how to improve the interface. Moving forward, we plan to add a query attribute for subsampling rules that will expose this kind of functionality more clearly.

1 Like

I’m sorry I missed this second question. I can’t think of a way that our current workflow supports the least common ancestors calculation without hacking into the Snakemake workflow itself. Possible solutions could include:

  1. Adding something like an --descending-order flag to the current priorities script such that the numbers it exports could be reversed and assigned to strains in that reversed order. This would require modifying this single script and wouldn’t support other more generic priorities.

  2. Add support for another type of priorities value to the subsampling rules besides “proximity” that allows users to specify their own predefined priorities file (something that lives in the profile, for example). This would be more generic, but it would require manual bioinformatics work outside of the automated workflow to maintain that custom priorities file.

  3. Create your own custom Snakemake rules in the profile’s config.yaml using the custom_rules attribute (e.g., custom_rules: ["my_profiles/example/custom_rules.smk"]). In the custom rules file, you could define your own rule to generate a priorities file. This rule would need to be given higher priority than the existing proximity_score rule and output files with the same name format ("results/{build_name}/proximity_{focus}.tsv"). This is the most generic approach and something you could do now without modifying any other code in the workflow. It is also the most complicated approach and requires a lot of Snakemake experience.

@rneher and @emmahodcroft have thought deeply on this issue and might have better recommendations than these.

If you are interested in pursuing any of these options, let me know and I can help figure out the specific implementation details.

@jlhudd has already covered this very comprehensively!! I can’t think of a better way for the 2nd question that what he has already proposed.

Mostly I just wanted to add that for the second method of the first question (to make a build with multiple counties), without the ‘mind-warp’ of the --query call. There’s an example of doing this in the example_advanced_customization/builds.yaml in the ncov repo: https://github.com/nextstrain/ncov/blob/master/my_profiles/example_advanced_customization/builds.yaml

You can see it isn the lac-leman build. You can see we take multiple ‘cantons’ (like states in the USA - ‘division’ level) which are near Lac Leman and make a build from this. You could easily do the same with counties (which are ‘location’ level). Perhaps it’s helpful to see this example and build on it! (I usually find it easier to modify examples/my existing builds than make new one from scratch!)

Thanks @jlhudd for the comprehensive answer and @emmahodcroft for the pointer to the example (indeed, always easier to build off example).