Build for multiple counties?

stacia · July 10, 2020, 8:28pm

Hello, I’d like to create a build for all of the Bay Area counties in CA (5 of them). Can one add multiple counties to the “division” line in the builds.yaml file? Right now we’re grepping for them from the metadata file and adding them to the “include” file. I’d like to have all genomes from the bay area and then subsampled from rest of tree. Also wondering if it’s possible to construct a build that has all genomes from a certain region, and then all of their least common ancestors (and that’s all)?
Thanks!

jlhudd · July 10, 2020, 10:10pm

Hi @stacia! There is a way to do this, but it isn’t clear right now. We’re working on improving the docs and interface for this type of filtering, but for now here is how you can workaround the current system.

Since v8.0.0, augur filter supports a --query argument that allows you to make pandas-style queries on the metadata columns. These queries support complex boolean logic like nested ORs and ANDs. As an example, you could run the following command to get all sequences specifically from King and Snohomish counties in Washington:

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --query "(country == 'USA') & (division == 'Washington') & (location in ['King County', 'Snohomish County'])" \
  --output counties.fasta

The filter command will run the given query on the metadata data frame and return all records for which the query evaluates as true.

If you are using our Snakemake workflow and defining your builds solely through the builds.yaml file, there is no explicit subsampling rule attribute for query yet (although it’s coming soon). But, the values of existing subsampling attributes like exclude and include get passed directly to the subsampling command, so you can use these to pass whatever extra command line arguments you like to augur filter.

For example, you could define a multi-county subsampling rule that looks like this:

subsampling:
  # Default subsampling logic for multiple adjacent counties.
  multicounty:
    # Focal samples for multiple counties
    bay_area:
      group_by: "division year month"
      seq_per_group: 48
      include: --query "(country == '{country}') & (division == '{division}') & (location in [{counties}])"
    # Contextual samples from the rest of the world that are genetically similar to county samples
    global:
      group_by: "country year month"
      seq_per_group: 1
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "bay_area"

Then you can define your build like so:

builds:
  bay-area:
    # Link this build to our custom multi-county subsampling scheme. 
    subsampling_scheme: multicounty
    region: North America
    country: USA
    division: California

    # Define as many counties as you like with appropriate quoting for the pandas query.
    counties: "'Sonoma', 'Napa', 'Marin', 'Solano', 'Contra Costa', 'Alameda', 'San Francisco', 'San Mateo', 'Santa Clara', 'Santa Cruz'"

Since the subsampling rules pull in any attributes you define in the build, this example defines the usual geographic attributes plus a list of counties that can be passed into the query argument. Note that there is some Inception-style quoting going on here to keep the YAML config, the command line, and pandas happy. This could definitely be implemented more cleanly.

To test your subsampling rule, you can run the Snakemake rule to only produce the multi-county sample.

snakemake \
  --profile my_profiles/your_profile \
  results/bay-area/sample-bay_area.fasta

An alternate implementation that doesn’t require as much mind-warping quoting would be to annotate your metadata file ahead of time with a custom column for the counties of interest. You could do this as a boolean integer field, for example is_bay_area, and then your subsampling scheme would look like this with a much simpler query argument:

subsampling:
  # Default subsampling logic for multiple adjacent counties.
  multicounty:
    # Focal samples for region
    bay_area:
      group_by: "division year month"
      seq_per_group: 48
      include: --query "(is_bay_area == 1)"
    # Contextual samples for region from the rest of the world
    global:
      group_by: "country year month"
      seq_per_group: 1
      exclude: --query "(is_bay_area == 0)"
      priorities:
        type: "proximity"
        focus: "bay_area"

With this scheme, your build YAML would not have to explicitly list all of the counties. That said, I do slightly prefer the first implementation above since it more clearly documents the locations that belong in “the Bay Area” definition. Folks from San Francisco might have a different definition of “Bay Area” than someone living in the Concord, for instance.

Do you think these approaches would work for your analysis? If you do try it out, let us know if you have suggestions for how to improve the interface. Moving forward, we plan to add a query attribute for subsampling rules that will expose this kind of functionality more clearly.

jlhudd · July 10, 2020, 10:35pm

I’m sorry I missed this second question. I can’t think of a way that our current workflow supports the least common ancestors calculation without hacking into the Snakemake workflow itself. Possible solutions could include:

Adding something like an --descending-order flag to the current priorities script such that the numbers it exports could be reversed and assigned to strains in that reversed order. This would require modifying this single script and wouldn’t support other more generic priorities.
Add support for another type of priorities value to the subsampling rules besides “proximity” that allows users to specify their own predefined priorities file (something that lives in the profile, for example). This would be more generic, but it would require manual bioinformatics work outside of the automated workflow to maintain that custom priorities file.
Create your own custom Snakemake rules in the profile’s config.yaml using the custom_rules attribute (e.g., custom_rules: ["my_profiles/example/custom_rules.smk"]). In the custom rules file, you could define your own rule to generate a priorities file. This rule would need to be given higher priority than the existing proximity_score rule and output files with the same name format ("results/{build_name}/proximity_{focus}.tsv"). This is the most generic approach and something you could do now without modifying any other code in the workflow. It is also the most complicated approach and requires a lot of Snakemake experience.

@rneher and @emmahodcroft have thought deeply on this issue and might have better recommendations than these.

If you are interested in pursuing any of these options, let me know and I can help figure out the specific implementation details.

emmahodcroft · July 13, 2020, 9:19am

@jlhudd has already covered this very comprehensively!! I can’t think of a better way for the 2nd question that what he has already proposed.

Mostly I just wanted to add that for the second method of the first question (to make a build with multiple counties), without the ‘mind-warp’ of the --query call. There’s an example of doing this in the example_advanced_customization/builds.yaml in the ncov repo: https://github.com/nextstrain/ncov/blob/master/my_profiles/example_advanced_customization/builds.yaml

You can see it isn the lac-leman build. You can see we take multiple ‘cantons’ (like states in the USA - ‘division’ level) which are near Lac Leman and make a build from this. You could easily do the same with counties (which are ‘location’ level). Perhaps it’s helpful to see this example and build on it! (I usually find it easier to modify examples/my existing builds than make new one from scratch!)

stacia · July 13, 2020, 2:10pm

Thanks @jlhudd for the comprehensive answer and @emmahodcroft for the pointer to the example (indeed, always easier to build off example).

Topic		Replies	Views
Subsampling sequences genetically related to a focal sample Help and Getting Started	0	459	January 14, 2022
Loss of country information in final auspice output Help and Getting Started	2	382	March 22, 2021
Snakemake Q: passing list of param values @ CLI --> one build per value Help and Getting Started	4	985	November 23, 2020
Help with Build Help and Getting Started	2	494	December 24, 2021
Diagnosing error + filtering issues Help and Getting Started	14	1656	November 9, 2020

Build for multiple counties?

Related topics