Snakemake Q: passing list of param values @ CLI --> one build per value

sidneymbell · November 17, 2020, 5:51pm

Hello! Quick snakemake Q:

I have a subsampling schema, county defined in my builds.yaml file.

I want something along the lines of:
$snakemake --profile my_profile --counties foo bar baz
–>
something equivalent to

for c in counties: 
      focal_county = c
      <run build with focal county set to 'c'>

I assume I need to use a wildcard in the builds.yaml file wherein counties is a top-level param that can then be set via the command line, but I’m not quite sure where to get started from there?

jlhudd · November 18, 2020, 9:09pm

Welcome, @sidneymbell! Would something like the example code in this post help you achieve what you’re trying to do? This shows how to include multiple counties in a single focal set (bay-area).

Or are you more interested in creating a separate build per county? If so, you could set this up like so:

subsampling:
  # Default subsampling logic for a single county.
  county:
    # Focal samples for multiple counties
    county_focus:
      group_by: "division year month"
      seq_per_group: 48
      include: --query "(country == '{country}') & (division == '{division}') & (location == '{county}')"
    # Contextual samples from the rest of the world that are genetically similar to county samples
    global:
      group_by: "country year month"
      seq_per_group: 1
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "county_focus"

Then your builds can be defined like so:

builds:
  california-sonoma:
    subsampling_scheme: county
    region: North America
    country: USA
    division: California
    county: Sonoma

sidneymbell · November 19, 2020, 12:05am

Thanks so much, @jlhudd!

This is close to what I’m trying to do, but the trick is that which counties we need to run a build for varies day-to-day. So, I’m hoping to be able to provide a --config key value1 value2 arg on the CLI which then fills in the county field, one value per build.

For now, I started by just trying to get it running with one county which can be defined via the CLI.
I ended up with this, which works if I manually replace {county} with a single value. However, I haven’t found a way to define county = "Santa Clara County" / county: "Santa Clara County" as a referenced variable, either by specifying it at the top of my config file or my build file.

Any advice?

# county: "Santa Clara County"  
# county = "Santa Clara County" # these both fail

builds:
  county_only:
    subsampling_scheme: county_only
    region: "North America"
    country: "USA"
    division: "California"
    location: {county} # this fails
    location: "Santa Clara County" # this runs
    title: "COVID Tracker: Santa Clara County, CA"

  county_plus_context:
    subsampling_scheme: county_plus_context
    region: "North America"
    country: "USA"
    division: "California"
    location: {county}
    title: "COVIDTracker: Santa Clara County and related contextual samples"


# Subsampling schemas
subsampling:
  county_only:
    county: # sample over time to a max of 2000 sequences
      group_by: "year month"
      max_sequences: 2000
      query: --query "(location == '{county}')"

  county_plus_context:
    county: # sample over time to a max of 1500 sequences
      group_by: "year month"
      max_sequences: 1500
      query: --query "(location == '{county}')"
  
    state:
      group_by: "location year month" # sample per county over time, up to 5 sequences per county per month
      seq_per_group: 5
      query: --query "(location != '{county}') & (division == '{division}')" # exclude add'l samples from {county}
      priorities:
        type: "proximity"
        focus: "county"
  
    country:
      group_by: "division year month" # sample per state over time, up to 2 sequences per state per month
      seq_per_group: 2
      query: --query "(division != '{division}') & (country == '{country}')" # exclude add'l samples from CA
      priorities:
        type: "proximity"
        focus: "county"
  
    international:
      group_by: "region year month" # sample per region over time, up to 2 sequences per region per month
      seq_per_group: 2
      query: --query "(country != '{country}')" # exclude add'l samples from USA
      priorities:
          type: "proximity"
          focus: "county"```

rsultana · November 20, 2020, 1:13am

Hi Sidney!
You could use the anchor and alias feature of yaml files to propagate the same county across your builld.yaml file, e.g.

    # county: "Santa Clara County"  
    county: &COUNTY "Santa Clara County"

    builds:
      county_only:
        subsampling_scheme: county_only
        region: "North America"
        country: "USA"
        division: "California"
        location: *COUNTY
        title: "COVID Tracker: Santa Clara County, CA"

      county_plus_context:
        subsampling_scheme: county_plus_context
        region: "North America"
        country: "USA"
        division: "California"
        location: *COUNTY
        title: "COVIDTracker: Santa Clara County and related contextual samples"

    # and so on ...

(basically creating a "county" config entry with a string value that propagates to multiple locations of your build definition)

The problem is that you can’t override this config parameter from the snakemake command line (or you can, but it doesn’t have the desired effect on the config dictionary (e.g. config[‘builds’][‘county_only’][‘location’], because that is already initialized from builds.yaml with the default value of the county config entry at the time when builds.yaml is read (I just tried that with the my_profiles/example nextstrain profile).

So I think a better approach to your problem (that wouldn’t depend on major modifications in the snakemake rules) would be to wrap your call to snakemake in a script that first generates a builds.yaml config using a template (e.g. with jinja2).

Hope this helps,
Razvan

sidneymbell · November 23, 2020, 10:20pm

Thanks so much, @rsultana!

I learned a new thing (anchor & alias) – super appreciated

Based on input from you and @jlhudd, and my own experimentation, I think there’s a tradeoff – this kind of wildcard customization + cli override is possible, but not if we want to avoid changing the underlying rules (i.e., avoid forking and deviating from the standard build).

For now, I’m defaulting to a script that generates a builds.yaml file as you suggest. I’ll update here if we find some other solution!

Topic		Replies	Views
Multiple subsampling from same alignment	2	364	September 1, 2021
What is the purpose of "subsampling" in the workflow?	0	354	January 6, 2021
Build for multiple counties? Help and Getting Started	4	798	July 13, 2020
Only global build found in ./auspice General	4	568	October 23, 2020
Missing input files for rule all Help and Getting Started	4	2333	April 21, 2021

Snakemake Q: passing list of param values @ CLI --> one build per value

Related topics