Methods to automate the list of pangos for augur filter?

Hi,
I frequently SARS-CoV-2 builds, and every time I need to update the list of pangos to filter on. For example like this:

builds:
    ba_2_86:
        pango_lineage: "['BA.2.86', 'BA.2.86.1', 'JN.1', 'JN.1.1', 'JN.1.1.1']"

Is there a clever way to update the list of pangos in the builds file? For example by referring to a text file of entries?

Is that example something that is already working, or is that an example of what you are trying to do? It may be easier to manage in native YAML:

builds:
    ba_2_86:
        pango_lineage:
        - BA.2.86
        - BA.2.86.1
        - JN.1
        - JN.1.1
        - JN.1.1.1

I havenā€™t tested, but this should work in a Snakemake rule that looks something like:

rule filter:
    params:
        include_where = ' '.join(f'pango_lineage="{lineage}"'
                                 for lineage in config['builds']['ba_2_86']['pango_lineage']),
    shell:
        """
        augur filter --include-where {params.include_where}
        """

If you want to keep the list in a separate text file, there should be a way in Snakemake to read the list from a text file into a list that can be used the same way as the example above.

Yes, my first example already works. But as you mention, maybe I can keep a separate file with a list of pangos and then make Snakemake access this the same way as it accesses the builds file through [ā€˜buildsā€™]? I will check it out.
Thanks

1 Like

@victorlin, may I ask you. In my builds.yaml file for the ncov workflow I currently specify a list of pango lineages as i mentioned above like this:

builds:
    ba_2_86:
        region: global
        country: Norway
        pango_lineage: "['BA.2.86', 'BA.2.86.1', 'JN.1', 'JN.1.1', 'JN.1.1.1']"
        subsampling_scheme: ba_2_86-scheme

And then I subsample like this:

subsampling:

  ba_2_86-scheme:
    country:
      min_date: "--min-date 2023-08-01"
      query: --query "(country == '{country}') & (pango_lineage in {pango_lineage})"
      # Skip diagnostics rule for in-house Norwegian sequences only
      skip_diagnostics: True
    related:
      min_date: "--min-date 2023-08-01"
      group_by: "country"
      max_sequences: 500
      sampling_scheme: "--probabilistic-sampling"
      query: --query "(country != '{country}') & (pango_lineage in {pango_lineage})"
      priorities:
        type: "proximity"
        focus: "country"

Iā€™m trying to figure out how the list of pango lineages actually gets into the augur filter command. Iā€™m not that familiar with Snakemake, but I see that thereā€™s a parameter ā€œquery_argumentā€ created by ā€œ_get_specific_subsampling_settingā€ that reads from the subsampling part of the builds.yaml file?

My goal is to avoid hard coding the pango list in the builds.yaml file. Could it be possible to instead refer to another file in builds.yaml that contains the list of pangos?

Hi @jonr,

First Iā€™ll explain how it works currently then propose how to use a separate file.

1: Explanation

Iā€™m trying to figure out how the list of pango lineages actually gets into the augur filter command.


In your example above, the lineages are defined as builds.ba_2_86.pango_lineage which corresponds to builds.<build>.<user-specified key> in the workflow config file reference, which says:

Builds support any named attributes that can be referenced by subsampling schemes.

What this means is that you can use {pango_lineage} as a value for any key under subsampling.<scheme>.<sample>, which you have done for subsampling.ba_2_86-scheme.*.query.

2: Separate file

Could it be possible to instead refer to another file in builds.yaml that contains the list of pangos?


This is not supported by the workflow currently but it should be possible by modifying main_workflow.smk.

Note that you might not want to use --include-where from my previous suggestion since it ignores other filter options. Using --query would continue to allow those pango lineage sequences to be filtered out by other options.

  1. Add a function to read the pango lineages from a file:

    def _get_pango_lineages(path):
        with open(path) as file:
            for line in file:
                yield line.rstrip()
    
  2. Modify the subsample rule to read the pango lineages into a param and use it in augur filter. Unfortunately, augur filter only accepts a single --query so you would have to move your current query from builds.yaml into the Snakemake file, which is complicated by the custom country key:

    rule subsample:
        params:
            custom_query = f'--query "(country != "{config['builds'][wildcards.build_name]['country']}") & (pango_lineage in {list(_get_pango_lineages(config['builds'][wildcards.build_name]['pango_lineage_file']))})"'
        shell:
            """
            augur filter \
                {params.custom_query}
            """
    

    In the future, Augur could support multiple --query options which would allow you to keep --query "(country != '{country}') in builds.yaml and use a simpler param value:

    rule subsample:
        params:
            custom_query = f'--query "(pango_lineage in {list(_get_pango_lineages(config['builds'][wildcards.build_name]['pango_lineage_file']))})"'
        shell:
            # params.query_argument is the value from builds.yaml
            """
            augur filter \
                {params.query_argument} \
                {params.custom_query}
            """
    

I did not test anything so you may need to make some edits.

Thanks @victorlin! I really appreciate this. I also hope this can be useful for others that routinely create SARS-CoV-2 builds.

Can I ask you what would be the best format of the file that holds the pango lineages for the augur filter query? Can it just be like:
"['BA.2.86', 'BA.2.86.1', 'JN.1', 'JN.1.1', 'JN.1.1.1']"

Apologies, I forgot to add some details.

  1. How to format the file?

    The _get_pango_lineages function defined above reads each line in the file as an entry. In your example, the file should look like:

    BA.2.86
    BA.2.86.1
    JN.1
    JN.1.1
    JN.1.1.1
    

    You can modify the _get_pango_lineages function if you want to use a different format.

  2. How to reference the file?

    The Snakemake workflow edit I shared above retrieves the file path from config as config['builds'][wildcards.build_name]['pango_lineage_file']. This means the file should be given in builds.yaml as:

    builds:
        ba_2_86:
            pango_lineage_file: path/to/pango_lineages.txt
    

    Note that the path should be relative to the workflow root directory (the folder containing Snakefile).

I hope this helps!

Thanks for all the help! I think it works, but first I got an error of unmatched bracket:
"f-string: unmatched '['"
I changed the single quotes around e.g. [ā€˜buildsā€™] to double quotes ([ā€œbuildsā€]) and it seemed to solve it. But now I get the error of wildcards not defined:

NameError in file /nextstrain/build/workflow/snakemake_rules/main_workflow.smk, line 315:
name 'wildcards' is not defined
  File "/nextstrain/build/Snakefile", line 167, in <module>
  File "/nextstrain/build/workflow/snakemake_rules/main_workflow.smk", line 315, in <module>

But I donā€™t understand this as the workflow seems to use the same format for accessing the build_name other places without problems?

wildcards is referencing Snakemakeā€™s wildcards which needs to be accessed within a function. I forgot to include that. Youā€™ll need to change the params to inline functions using lambda wildcards:

- custom_query = f'--query "(country != "{config['builds'][wildcards.build_name]['country']}") & (pango_lineage in {list(_get_pango_lineages(config['builds'][wildcards.build_name]['pango_lineage_file']))})"'
+ custom_query = lambda wildcards: f'--query "(country != "{config['builds'][wildcards.build_name]['country']}") & (pango_lineage in {list(_get_pango_lineages(config['builds'][wildcards.build_name]['pango_lineage_file']))})"'