I have a subsampling schema, county defined in my builds.yaml file.
I want something along the lines of: $snakemake --profile my_profile --counties foo bar baz
–>
something equivalent to
for c in counties:
focal_county = c
<run build with focal county set to 'c'>
I assume I need to use a wildcard in the builds.yaml file wherein counties is a top-level param that can then be set via the command line, but I’m not quite sure where to get started from there?
Welcome, @sidneymbell! Would something like the example code in this post help you achieve what you’re trying to do? This shows how to include multiple counties in a single focal set (bay-area).
Or are you more interested in creating a separate build per county? If so, you could set this up like so:
subsampling:
# Default subsampling logic for a single county.
county:
# Focal samples for multiple counties
county_focus:
group_by: "division year month"
seq_per_group: 48
include: --query "(country == '{country}') & (division == '{division}') & (location == '{county}')"
# Contextual samples from the rest of the world that are genetically similar to county samples
global:
group_by: "country year month"
seq_per_group: 1
exclude: "--exclude-where 'region={region}'"
priorities:
type: "proximity"
focus: "county_focus"
Then your builds can be defined like so:
builds:
california-sonoma:
subsampling_scheme: county
region: North America
country: USA
division: California
county: Sonoma
This is close to what I’m trying to do, but the trick is that which counties we need to run a build for varies day-to-day. So, I’m hoping to be able to provide a --config key value1 value2 arg on the CLI which then fills in the county field, one value per build.
For now, I started by just trying to get it running with one county which can be defined via the CLI.
I ended up with this, which works if I manually replace {county} with a single value. However, I haven’t found a way to define county = "Santa Clara County" / county: "Santa Clara County" as a referenced variable, either by specifying it at the top of my config file or my build file.
Any advice?
# county: "Santa Clara County"
# county = "Santa Clara County" # these both fail
builds:
county_only:
subsampling_scheme: county_only
region: "North America"
country: "USA"
division: "California"
location: {county} # this fails
location: "Santa Clara County" # this runs
title: "COVID Tracker: Santa Clara County, CA"
county_plus_context:
subsampling_scheme: county_plus_context
region: "North America"
country: "USA"
division: "California"
location: {county}
title: "COVIDTracker: Santa Clara County and related contextual samples"
# Subsampling schemas
subsampling:
county_only:
county: # sample over time to a max of 2000 sequences
group_by: "year month"
max_sequences: 2000
query: --query "(location == '{county}')"
county_plus_context:
county: # sample over time to a max of 1500 sequences
group_by: "year month"
max_sequences: 1500
query: --query "(location == '{county}')"
state:
group_by: "location year month" # sample per county over time, up to 5 sequences per county per month
seq_per_group: 5
query: --query "(location != '{county}') & (division == '{division}')" # exclude add'l samples from {county}
priorities:
type: "proximity"
focus: "county"
country:
group_by: "division year month" # sample per state over time, up to 2 sequences per state per month
seq_per_group: 2
query: --query "(division != '{division}') & (country == '{country}')" # exclude add'l samples from CA
priorities:
type: "proximity"
focus: "county"
international:
group_by: "region year month" # sample per region over time, up to 2 sequences per region per month
seq_per_group: 2
query: --query "(country != '{country}')" # exclude add'l samples from USA
priorities:
type: "proximity"
focus: "county"```
Hi Sidney!
You could use the anchor and alias feature of yaml files to propagate the same county across your builld.yaml file, e.g.
# county: "Santa Clara County"
county: &COUNTY "Santa Clara County"
builds:
county_only:
subsampling_scheme: county_only
region: "North America"
country: "USA"
division: "California"
location: *COUNTY
title: "COVID Tracker: Santa Clara County, CA"
county_plus_context:
subsampling_scheme: county_plus_context
region: "North America"
country: "USA"
division: "California"
location: *COUNTY
title: "COVIDTracker: Santa Clara County and related contextual samples"
# and so on ...
(basically creating a "county" config entry with a string value that propagates to multiple locations of your build definition)
The problem is that you can’t override this config parameter from the snakemake command line (or you can, but it doesn’t have the desired effect on the config dictionary (e.g. config[‘builds’][‘county_only’][‘location’], because that is already initialized from builds.yaml with the default value of the county config entry at the time when builds.yaml is read (I just tried that with the my_profiles/example nextstrain profile).
So I think a better approach to your problem (that wouldn’t depend on major modifications in the snakemake rules) would be to wrap your call to snakemake in a script that first generates a builds.yaml config using a template (e.g. with jinja2).
I learned a new thing (anchor & alias) – super appreciated
Based on input from you and @jlhudd, and my own experimentation, I think there’s a tradeoff – this kind of wildcard customization + cli override is possible, but not if we want to avoid changing the underlying rules (i.e., avoid forking and deviating from the standard build).
For now, I’m defaulting to a script that generates a builds.yaml file as you suggest. I’ll update here if we find some other solution!