KeyError: UndefinedVariableError: name is not defined

Hi Nextstrain team,
I was running into a KeyError when I ran a new build.
When I searched similar threads on this platform, I found that the config file might be an issue, but the builds and config files I created look good to me. I’d appreciate if you could have a look into that.
Many thanks and I copied the error message when I run the build below:


   augur filter             --sequences results/combined_sequences_for_subsampling.fasta.xz             --metadata results/combined_metadata.tsv.xz             --exclude-all             --include results/NYU_rec_three/sample-all.txt             --output-sequences results/NYU_rec_three/NYU_rec_three_subsampled_sequences.fasta.xz             --output-metadata results/NYU_rec_three/NYU_rec_three_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_NYU_rec_three.txt

Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/scope.py", line 200, in resolve
    return self.resolvers[key]
  File "/usr/local/lib/python3.7/collections/__init__.py", line 916, in __getitem__
    return self.__missing__(key)            # support subclasses that define __missing__
  File "/usr/local/lib/python3.7/collections/__init__.py", line 908, in __missing__
    raise KeyError(key)
KeyError: 'recombinant'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/scope.py", line 211, in resolve
    return self.temps[key]
KeyError: 'recombinant'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/augur", line 33, in <module>
    sys.exit(load_entry_point('nextstrain-augur', 'console_scripts', 'augur')())
  File "/nextstrain/augur/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/nextstrain/augur/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
  File "/nextstrain/augur/augur/filter.py", line 1370, in run
    include_by,
  File "/nextstrain/augur/augur/filter.py", line 780, in apply_filters
    **filter_kwargs,
  File "/usr/local/lib/python3.7/site-packages/pandas/core/generic.py", line 5430, in pipe
    return com.pipe(self, func, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/common.py", line 471, in pipe
    return func(obj, *args, **kwargs)
  File "/nextstrain/augur/augur/filter.py", line 239, in filter_by_query
    return set(metadata.query(query).index.values)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/frame.py", line 4060, in query
    res = self.eval(expr, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/frame.py", line 4191, in eval
    return _eval(expr, inplace=inplace, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/eval.py", line 348, in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 806, in __init__
    self.terms = self.parse()
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 825, in parse
    return self._visitor.visit(self.expr)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 411, in visit
    return visitor(node, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 417, in visit_Module
    return self.visit(expr, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 411, in visit
    return visitor(node, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 420, in visit_Expr
    return self.visit(node.value, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 411, in visit
    return visitor(node, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 745, in visit_BoolOp
    return reduce(visitor, operands)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 739, in visitor
    rhs = self._try_visit_binop(y)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 734, in _try_visit_binop
    return self.visit(bop)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 411, in visit
    return visitor(node, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 745, in visit_BoolOp
    return reduce(visitor, operands)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 738, in visitor
    lhs = self._try_visit_binop(x)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 734, in _try_visit_binop
    return self.visit(bop)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 411, in visit
    return visitor(node, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 718, in visit_Compare
    return self.visit(binop)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 411, in visit
    return visitor(node, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 532, in visit_BinOp
    op, op_class, left, right = self._maybe_transform_eq_ne(node)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 452, in _maybe_transform_eq_ne
    left = self.visit(node.left, side="left")
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 411, in visit
    return visitor(node, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/expr.py", line 545, in visit_Name
    return self.term_type(node.id, self.env, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/ops.py", line 98, in __init__
    self._value = self._resolve_name()
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/ops.py", line 115, in _resolve_name
    res = self.env.resolve(self.local_name, is_local=self.is_local)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/computation/scope.py", line 216, in resolve
    raise UndefinedVariableError(key, is_local) from err
pandas.core.computation.ops.UndefinedVariableError: name 'recombinant' is not defined
[Fri Mar 18 14:48:17 2022]
Error in rule filter:
    jobid: 47
    output: results/subsampling/filtered.fasta, results/subsampling/filtered_log.tsv
    log: logs/filtered_subsampling.txt (check log file(s) for error message)
    shell:

        augur filter             --sequences results/subsampling/masked.fasta             --metadata results/subsampling/metadata_with_index.tsv             --include defaults/include.txt             --query "(\`North-America\` == 'yes' & _length >= 27000) | (\`recombinant\` == 'yes' & _length >= 15000) | (\`references\` == 'yes' & _length >= 27000) | (\`AY.45\` == 'yes' & _length >= 27000)"             --max-date 2022-03-19             --min-date 2019.74             --exclude-ambiguous-dates-by any             --exclude defaults/exclude.txt results/subsampling/excluded_by_diagnostics.txt             --exclude-where division='USA'            --output results/subsampling/filtered.fasta             --output-log results/subsampling/filtered_log.tsv 2>&1 | tee logs/filtered_subsampling.txt;

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job filter since they might be corrupted:
results/subsampling/filtered_log.tsv

That’s my build.yaml file: (thanks!)


# Define inputs
inputs:
  - name: North-America
    metadata: data/rec/hcov_north-america_2022-03-14.tar.gz
    sequences: data/rec/hcov_north-america_2022-03-14.tar.gz
  - name: recombinant
    metadata: data/rec/gisaid_auspice_rec.tar
    sequences: data/rec/gisaid_auspice_rec.tar
  - name: references
    metadata: data/references_metadata.tsv
    sequences: data/references_sequences.fasta
  - name: AY.45
    metadata: data/rec/gisaid_auspice_AY.45_2021-10-1_2022_03_17.tar
    sequences: data/rec/gisaid_auspice_AY.45_2021-10-1_2022_03_17.tar

# Define builds
builds:
  NYU_rec_three:
    region: North America
    country: USA
    subsampling_scheme: focal-contextual

# Define subsampling schemes
  subsampling:
   focal-contextual:
    focal:
      query: --query "gisaid_epi_isl == 'EPI_ISL_10792641'"
    contextual:
      query: --query "gisaid_epi_isl != 'EPI_ISL_10792641'"
      max_sequences: 5000
      priorities:
        type: proximity
        focus: focal

# Define filter for quality
filter:
  recombinant:
    min_length: 24000 
#   skip_diagnostics: True  

files:
  auspice_config: "my_profiles/recomb/my_auspice_config_rec_RD.json"
  description: "my_profiles/recomb/my_description_rec_RD.MD"

Hi Ralf,

Thanks for providing so much output, that’s really helpful! (For the future, it’d be easier to read the output if you put code fences around any script/code with triple backtick ```)

I notice a few things that may get us closer to a solution:

  1. The first augur filter command inputs .xz compressed sequences and metadata. I’m not sure this is supported. It may be, but I’m not 100% sure, @jlhudd knows best.
  2. You have two different augur filter commands, the one at the top of your post, and one at the bottom. They are quite different from each other. Which one creates the error?
  3. The error seems to suggest that the key recombinant wasn’t found in your metadata. You provide a relatively complicated query in your second filter command. My first guess would be that there’s an error in your query or that somehow there’s no recombinant column in your metadata. Do you still get an error if you remove the whole --query part?
  4. I don’t quite understand how you got to that query based on your build.yaml file, since no such query mentioning recombinant appears there.

Best,

Cornelius

Hi Cornelius,
Thanks for looking into this complicated issue!
I will use the triple backticks next time (sorry for this!).
I was able to run the build now and I’m quickly summarizing what I changed and what I think caused the error:

  1. According to the Nextstrain tutorial video, I first tried .xz compressed sequences and metadata, but I now unzipped the files to exclude this as a possible reason for the error.
  2. I agree, the filter and query were a bit complicated, but they didn’t cause the error.
  3. It turned out that the data entries caused the error:
inputs:
name: North-America
metadata: data/rec/hcov_north-america_2022-03-14.tar.gz
sequences: data/rec/hcov_north-america_2022-03-14.tar.gz
name: recombinant
metadata: data/rec/gisaid_auspice_rec.tar
sequences: data/rec/gisaid_auspice_rec.tar
name: references
metadata: data/references_metadata.tsv
sequences: data/references_sequences.fasta
name: AY.45
metadata: data/rec/gisaid_auspice_AY.45_2021-10-1_2022_03_17.tar
sequences: data/rec/gisaid_auspice_AY.45_2021-10-1_2022_03_17.tar

First, it seemed the more data entries I had, the more likely I got an error. Therefore, I limited it to three inputs now.
Second, it was problematic to have too many sequences from a specific Pango lineage. When I used a more balanced Pango distribution (using a smaller AY.45 data set), the build worked.

So this is resolved! (I have an unrelated smaller issue for which I’m going to open another thread)
Thanks much for your valuable advice!

Ralf

1 Like

Thanks for explaining how you solved it. I’m not sure how much I contributed :smiley: Sometimes trying to write it out is enough to figure it out oneself.

I checked and .xz compressed inputs and outputs are allowed since v11 of Augur - so that will not have been the problem.

I don’t quite understand how having many inputs caused the problem - it shouldn’t. But maybe there was something in those sequences or metadata that was problematic. And when you made the dataset smaller the problematic sequence disappeared.

I fully agree, there was something with this large dataset that caused the error (although downloaded from GISAID).
Thanks for confirming that .xz compressed inputs work!

This could be a bug in our ncov build.

The per-input min length filter is implemented by constructing an augur filter query that conditions on the input name (as set in builds.yaml) as a column:

These columns named after the builds.yaml inputs are made during a metadata combination step:

In this case, one of the input datasets is named recombinant and thus the eventual augur filter query constructed by the first code snippet above expects a generated column in the metadata called recombinant. However, it seems like it can go missing under some condition. Perhaps subsampling is implicated, for example none of the sequences from that input happen to make it thru?

Thank you, Thomas, that’s interesting!
Yes, none of the sequences made it through initially, but with subsampling, it somehow worked.

1 Like