Hi @stacia! There is a way to do this, but it isn’t clear right now. We’re working on improving the docs and interface for this type of filtering, but for now here is how you can workaround the current system.
Since v8.0.0, augur filter supports a --query
argument that allows you to make pandas-style queries on the metadata columns. These queries support complex boolean logic like nested ORs and ANDs. As an example, you could run the following command to get all sequences specifically from King and Snohomish counties in Washington:
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--query "(country == 'USA') & (division == 'Washington') & (location in ['King County', 'Snohomish County'])" \
--output counties.fasta
The filter command will run the given query on the metadata data frame and return all records for which the query evaluates as true.
If you are using our Snakemake workflow and defining your builds solely through the builds.yaml
file, there is no explicit subsampling rule attribute for query
yet (although it’s coming soon). But, the values of existing subsampling attributes like exclude
and include
get passed directly to the subsampling command, so you can use these to pass whatever extra command line arguments you like to augur filter.
For example, you could define a multi-county subsampling rule that looks like this:
subsampling:
# Default subsampling logic for multiple adjacent counties.
multicounty:
# Focal samples for multiple counties
bay_area:
group_by: "division year month"
seq_per_group: 48
include: --query "(country == '{country}') & (division == '{division}') & (location in [{counties}])"
# Contextual samples from the rest of the world that are genetically similar to county samples
global:
group_by: "country year month"
seq_per_group: 1
exclude: "--exclude-where 'region={region}'"
priorities:
type: "proximity"
focus: "bay_area"
Then you can define your build like so:
builds:
bay-area:
# Link this build to our custom multi-county subsampling scheme.
subsampling_scheme: multicounty
region: North America
country: USA
division: California
# Define as many counties as you like with appropriate quoting for the pandas query.
counties: "'Sonoma', 'Napa', 'Marin', 'Solano', 'Contra Costa', 'Alameda', 'San Francisco', 'San Mateo', 'Santa Clara', 'Santa Cruz'"
Since the subsampling rules pull in any attributes you define in the build, this example defines the usual geographic attributes plus a list of counties that can be passed into the query argument. Note that there is some Inception-style quoting going on here to keep the YAML config, the command line, and pandas happy. This could definitely be implemented more cleanly.
To test your subsampling rule, you can run the Snakemake rule to only produce the multi-county sample.
snakemake \
--profile my_profiles/your_profile \
results/bay-area/sample-bay_area.fasta
An alternate implementation that doesn’t require as much mind-warping quoting would be to annotate your metadata file ahead of time with a custom column for the counties of interest. You could do this as a boolean integer field, for example is_bay_area
, and then your subsampling scheme would look like this with a much simpler query argument:
subsampling:
# Default subsampling logic for multiple adjacent counties.
multicounty:
# Focal samples for region
bay_area:
group_by: "division year month"
seq_per_group: 48
include: --query "(is_bay_area == 1)"
# Contextual samples for region from the rest of the world
global:
group_by: "country year month"
seq_per_group: 1
exclude: --query "(is_bay_area == 0)"
priorities:
type: "proximity"
focus: "bay_area"
With this scheme, your build YAML would not have to explicitly list all of the counties. That said, I do slightly prefer the first implementation above since it more clearly documents the locations that belong in “the Bay Area” definition. Folks from San Francisco might have a different definition of “Bay Area” than someone living in the Concord, for instance.
Do you think these approaches would work for your analysis? If you do try it out, let us know if you have suggestions for how to improve the interface. Moving forward, we plan to add a query
attribute for subsampling rules that will expose this kind of functionality more clearly.