Failure when specifying multiple pango lineages in a build

I tried to make a build for only Pango lineage B.1.160 and all sub-lineages, but I get an error in augur filter which I think is due to the way I specified the list of lineages.

This is from my build file:

builds:
  turbuss:
    region: global
    country: Norway
    subsampling_scheme: turbuss-scheme
    pango_lineage: "'B.1.160', 'AB.1', 'B.1.160.1', 'B.1.160.2', 'B.1.160.3', 'B.1.160.4', 'B.1.160.5', 'B.1.160.6', 'B.1.160.7', 'B.1.160.8', 'B.1.160.9', 'B.1.160.10', 'B.1.160.11', 'B.1.160.12', 'B.1.160.14', 'B.1.160.15', 'B.1.160.16', 'B.1.160.17', 'B.1.160.18', 'B.1.160.19', 'B.1.160.20', 'B.1.160.21', 'B.1.160.22', 'B.1.160.23', 'B.1.160.24', 'B.1.160.25', 'B.1.160.26', 'B.1.160.27', 'B.1.160.28', 'B.1.160.29', 'B.1.160.30', 'B.1.160.31', 'B.1.160.32', 'B.1.160.33'"

subsampling:
  turbuss-scheme:
        country:
          group_by: "country"
          max_sequences: 4000
          # query: --query "(country == '{country}') & (pango_lineage == '{pango_lineage}')"
          query: --query "(country == '{country}') & (pango_lineage in '{pango_lineage}')"
          #query: --query "(country == '{country}') & ('{pango_lineage}' in pango_lineage)"

        related:
          group_by: "country year month"
          max_sequences: 1000
        # exclude: --exclude-where "country!='{country}'"
        # query: --query "(pango_lineage == '{pango_lineage}') & (country != '{country}') "
          query: --query "('{pango_lineage}' in pango_lineage) & (country != '{country}') "
        #  sampling_scheme: --probabilistic-sampling
          priorities:
            type: "proximity"

And this is from the output:

        augur filter             --sequences results/filtered_turbuss.fasta.xz             --metadata results/sanitized_metadata_turbuss.tsv.xz             --sequence-index results/combined_sequence_index.tsv.xz             --include my_profiles/fhi/include.txt             --exclude defaults/exclude.txt                                                                 --query "(country == 'Norway') & (pango_lineage in ''B.1.160', 'AB.1', 'B.1.160.1', 'B.1.160.2', 'B.1.160.3', 'B.1.160.4', 'B.1.160.5', 'B.1.160.6', 'B.1.160.7', 'B.1.160.8', 'B.1.160.9', 'B.1.160.10', 'B.1.160.11', 'B.1.160.12', 'B.1.160.14', 'B.1.160.15', 'B.1.160.16', 'B.1.160.17', 'B.1.160.18', 'B.1.160.19', 'B.1.160.20', 'B.1.160.21', 'B.1.160.22', 'B.1.160.23', 'B.1.160.24', 'B.1.160.25', 'B.1.160.26', 'B.1.160.27', 'B.1.160.28', 'B.1.160.29', 'B.1.160.30', 'B.1.160.31', 'B.1.160.32', 'B.1.160.33'')"                                       --group-by country                          --subsample-max-sequences 4000                          --output results/turbuss/sample-country.fasta             --output-strains results/turbuss/sample-country.txt 2>&1 | tee logs/subsample_turbuss_country.txt
        
Traceback (most recent call last):
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 45, in metadata
    metadata = metadata.query(self.query).copy()
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/frame.py", line 3469, in query
    res = self.eval(expr, **kwargs)
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/frame.py", line 3599, in eval
    return _eval(expr, inplace=inplace, **kwargs)
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 342, in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 798, in __init__
    self.terms = self.parse()
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 817, in parse
    return self._visitor.visit(self.expr)
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 397, in visit
    raise e
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 393, in visit
    node = ast.fix_missing_locations(ast.parse(clean))
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    (country =='Norway')and (pango_lineage in ''B .1 .160 ', 'AB .1 ', 'B .1 .160 .1 ', 'B .1 .160 .2 ', 'B .1 .160 .3 ', 'B .1 .160 .4 ', 'B .1 .160 .5 ', 'B .1 .160 .6 ', 'B .1 .160 .7 ', 'B .1 .160 .8 ', 'B .1 .160 .9 ', 'B .1 .160 .10 ', 'B .1 .160 .11 ', 'B .1 .160 .12 ', 'B .1 .160 .14 ', 'B .1 .160 .15 ', 'B .1 .160 .16 ', 'B .1 .160 .17 ', 'B .1 .160 .18 ', 'B .1 .160 .19 ', 'B .1 .160 .20 ', 'B .1 .160 .21 ', 'B .1 .160 .22 ', 'B .1 .160 .23 ', 'B .1 .160 .24 ', 'B .1 .160 .25 ', 'B .1 .160 .26 ', 'B .1 .160 .27 ', 'B .1 .160 .28 ', 'B .1 .160 .29 ', 'B .1 .160 .30 ', 'B .1 .160 .31 ', 'B .1 .160 .32 ', 'B .1 .160 .33 '')
                                                ^
SyntaxError: Python keyword not valid identifier in numexpr query

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jonr/miniconda3/envs/nextstrain/bin/augur", line 10, in <module>
    sys.exit(main())
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/filter.py", line 312, in run
    filtered = set(filter_by_query(list(seq_keep), args.metadata, args.query))
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/filter.py", line 92, in filter_by_query
    filtered_meta_dict, _ = read_metadata(metadata_file, query)
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 74, in read_metadata
    return MetadataFile(fname, query).read()
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 21, in read
    self.check_metadata_duplicates()
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 55, in check_metadata_duplicates
    self.metadata[self.key_type]
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 47, in metadata
    raise ValueError(
ValueError: Error applying pandas query to metadata: `(country == 'Norway') & (pango_lineage in ''B.1.160', 'AB.1', 'B.1.160.1', 'B.1.160.2', 'B.1.160.3', 'B.1.160.4', 'B.1.160.5', 'B.1.160.6', 'B.1.160.7', 'B.1.160.8', 'B.1.160.9', 'B.1.160.10', 'B.1.160.11', 'B.1.160.12', 'B.1.160.14', 'B.1.160.15', 'B.1.160.16', 'B.1.160.17', 'B.1.160.18', 'B.1.160.19', 'B.1.160.20', 'B.1.160.21', 'B.1.160.22', 'B.1.160.23', 'B.1.160.24', 'B.1.160.25', 'B.1.160.26', 'B.1.160.27', 'B.1.160.28', 'B.1.160.29', 'B.1.160.30', 'B.1.160.31', 'B.1.160.32', 'B.1.160.33'')` (Python keyword not valid identifier in numexpr query (<unknown>, line 1))
Waiting at most 5 seconds for missing files.
MissingOutputException in line 368 of /home/jonr/Prosjekter/Nextstrain/ncov/workflow/snakemake_rules/main_workflow.smk:
Job Missing files after 5 seconds:
results/turbuss/sample-country.fasta
results/turbuss/sample-country.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 16 completed successfully, but some output files are missing. 16
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 589, in handle_job_success
  File "/home/jonr/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 252, in handle_job_success
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /home/jonr/Prosjekter/Nextstrain/ncov/.snakemake/log/2021-08-18T103107.376129.snakemake.log

I tried use the code bellow but failed,
augur filter --metadata metadata.tsv --query “pangolin_lineage == ‘C.37’” --min-date 2021-07-01 --max-date 2021-07-15 --exclude-ambiguous-dates-by any --output-strains teste_filter.txt
But I observed that table contain the header Pango lineage, not pangolin_lineage. But when I use Pango lineage also failed.
augur filter --metadata metadata.tsv --query “Pango lineage == ‘C.37’” --min-date 2021-07-01 --max-date 2021-07-15 --exclude-ambiguous-dates-by any --output-strains teste_filter.txt

Any suggestion?

Hi @jonr. Your hunch is correct. You’ll need to specify your list of Pango lineages like this instead:

    pango_lineage: "['B.1.160', …, 'B.1.160.33']"

Note the enclosing square brackets to define a list.

Alternatively, I think you might be able to avoid some of this syntax nitpicking by defining the build variable as a YAML array (instead of a YAML string representation of a Python list):

    pango_lineage:
        - 'B.1.160'
        - …
        - 'B.1.160.33'

Regardless, you’ll also need to make sure that your query uses the condition (pango_lineage in {pango_lineage}). Note there are no quotes around the interpolated value and the interpolated value must be on the right-hand side of the in operator, not the left as in one of your examples.

I hope that helps!

1 Like

@mattoslmp You can quote columns with spaces (and other special chars) in their names using backticks in your query. For example:

--query "`Pango lineage` == 'C.37'"

Note that if you’re using Nextstrain’s metadata.tsv file, then the column should be pango_lineage, with an underscore instead of a space.

1 Like

Thank you @trs for your explanation

Thanks @trs !
This solved it!

1 Like