I havenāt tested, but this should work in a Snakemake rule that looks something like:
rule filter:
params:
include_where = ' '.join(f'pango_lineage="{lineage}"'
for lineage in config['builds']['ba_2_86']['pango_lineage']),
shell:
"""
augur filter --include-where {params.include_where}
"""
If you want to keep the list in a separate text file, there should be a way in Snakemake to read the list from a text file into a list that can be used the same way as the example above.
Yes, my first example already works. But as you mention, maybe I can keep a separate file with a list of pangos and then make Snakemake access this the same way as it accesses the builds file through [ābuildsā]? I will check it out.
Thanks
Iām trying to figure out how the list of pango lineages actually gets into the augur filter command. Iām not that familiar with Snakemake, but I see that thereās a parameter āquery_argumentā created by ā_get_specific_subsampling_settingā that reads from the subsampling part of the builds.yaml file?
My goal is to avoid hard coding the pango list in the builds.yaml file. Could it be possible to instead refer to another file in builds.yaml that contains the list of pangos?
First Iāll explain how it works currently then propose how to use a separate file.
1: Explanation
Iām trying to figure out how the list of pango lineages actually gets into the augur filter command.
In your example above, the lineages are defined as builds.ba_2_86.pango_lineage which corresponds to builds.<build>.<user-specified key> in the workflow config file reference, which says:
Builds support any named attributes that can be referenced by subsampling schemes.
What this means is that you can use {pango_lineage} as a value for any key under subsampling.<scheme>.<sample>, which you have done for subsampling.ba_2_86-scheme.*.query.
2: Separate file
Could it be possible to instead refer to another file in builds.yaml that contains the list of pangos?
This is not supported by the workflow currently but it should be possible by modifying main_workflow.smk.
Note that you might not want to use --include-where from my previous suggestion since it ignores other filter options. Using --query would continue to allow those pango lineage sequences to be filtered out by other options.
Add a function to read the pango lineages from a file:
def _get_pango_lineages(path):
with open(path) as file:
for line in file:
yield line.rstrip()
Modify the subsample rule to read the pango lineages into a param and use it in augur filter. Unfortunately, augur filter only accepts a single --query so you would have to move your current query from builds.yaml into the Snakemake file, which is complicated by the custom country key:
In the future, Augur could support multiple --query options which would allow you to keep --query "(country != '{country}') in builds.yaml and use a simpler param value:
rule subsample:
params:
custom_query = f'--query "(pango_lineage in {list(_get_pango_lineages(config['builds'][wildcards.build_name]['pango_lineage_file']))})"'
shell:
# params.query_argument is the value from builds.yaml
"""
augur filter \
{params.query_argument} \
{params.custom_query}
"""
I did not test anything so you may need to make some edits.
Thanks @victorlin! I really appreciate this. I also hope this can be useful for others that routinely create SARS-CoV-2 builds.
Can I ask you what would be the best format of the file that holds the pango lineages for the augur filter query? Can it just be like: "['BA.2.86', 'BA.2.86.1', 'JN.1', 'JN.1.1', 'JN.1.1.1']"
The _get_pango_lineages function defined above reads each line in the file as an entry. In your example, the file should look like:
BA.2.86
BA.2.86.1
JN.1
JN.1.1
JN.1.1.1
You can modify the _get_pango_lineages function if you want to use a different format.
How to reference the file?
The Snakemake workflow edit I shared above retrieves the file path from config as config['builds'][wildcards.build_name]['pango_lineage_file']. This means the file should be given in builds.yaml as:
Thanks for all the help! I think it works, but first I got an error of unmatched bracket: "f-string: unmatched '['"
I changed the single quotes around e.g. [ābuildsā] to double quotes ([ābuildsā]) and it seemed to solve it. But now I get the error of wildcards not defined:
NameError in file /nextstrain/build/workflow/snakemake_rules/main_workflow.smk, line 315:
name 'wildcards' is not defined
File "/nextstrain/build/Snakefile", line 167, in <module>
File "/nextstrain/build/workflow/snakemake_rules/main_workflow.smk", line 315, in <module>
But I donāt understand this as the workflow seems to use the same format for accessing the build_name other places without problems?
wildcards is referencing Snakemakeās wildcards which needs to be accessed within a function. I forgot to include that. Youāll need to change the params to inline functions using lambda wildcards: