Pandas error during augur filter: 'NoneType' object has no attribute 'str'

Hi,
I get this error when running augur filter, and I’m not quite sure how to investigate the problem. I run this build every week with Gisaid data, so I suspect there might be some strange strain names or something?

[Thu Aug 10 08:27:43 2023]
Job 16: 
        Subsample all sequences by 'country' scheme for build 'omicron_xbb' with the following parameters:

         - group by: 
         - sequences per group: 
         - subsample max sequences: 
         - min-date: --min-date 2023-01-01
         - max-date: 
         - 
         - exclude: 
         - include: 
         - query: --query '(country == '"'"'Norway'"'"') & (pango_lineage.str.startswith('"'"'XBB'"'"'))'
         - priority: 
        
Reason: Missing output files: results/omicron_xbb/sample-country.txt; Input files updated by another job: results/combined_metadata.tsv.xz


        augur filter             --metadata results/combined_metadata.tsv.xz             --include defaults/include.txt             --exclude defaults/exclude.txt             --min-date 2023-01-01                                                    --query '(country == '"'"'Norway'"'"') & (pango_lineage.str.startswith('"'"'XBB'"'"'))'                                                                                           --output-strains results/omicron_xbb/sample-country.txt 2>&1 | tee logs/subsample_omicron_xbb_country.txt
        
ERROR: Internal Pandas error when applying query:
	'NoneType' object has no attribute 'str'
Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.
[Thu Aug 10 08:27:48 2023]
Error in rule subsample:
    jobid: 16
    input: results/combined_metadata.tsv.xz, defaults/include.txt, defaults/include.txt, defaults/exclude.txt
    output: results/omicron_xbb/sample-country.txt
    log: logs/subsample_omicron_xbb_country.txt (check log file(s) for error details)
    conda-env: /nextstrain/build/.snakemake/conda/82d2d2badedd44ba5b8338b34064ad7d_
    shell:
        
        augur filter             --metadata results/combined_metadata.tsv.xz             --include defaults/include.txt             --exclude defaults/exclude.txt             --min-date 2023-01-01                                                    --query '(country == '"'"'Norway'"'"') & (pango_lineage.str.startswith('"'"'XBB'"'"'))'                                                                                           --output-strains results/omicron_xbb/sample-country.txt 2>&1 | tee logs/subsample_omicron_xbb_country.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

1 Like

Thanks for reporting this issue!

Without having the metadata I can’t reproduce, but my guess is that the pango_lineage field in some of your metadata rows is empty. Incidentally, it looks like the most recent data on GISAID currently says “designation in progress” which might cause the pango_lineage column to be empty:

I’ve put line breaks in your code so it’s easier to look at the command:

One way to work around it is to pre-filter your data. For example with tsv-utils (mamba install tsv-utils) should should be able to do:

xzcat results/combined_metadata.tsv.xz \
| tsv-filter -H --not-empty pango_lineage \
>results/filtered_metadata.tsv

Using that .str method is something that I think we don’t officially support. But @victorlin might have some thoughts.

By the way, if your goal is to get all the Norwegian sequences that belong to XBB, the current query will only return a subset of XBB-lineages, as it ignores aliases.

For example EG.5.1 descends from XBB but wouldn’t be returned by your query.

To filter across aliases, you could have a look at the pango_aliasor package that makes it easy to unalias lineages.

I’ve written a quick script that adds a new column to the input tsv so you can use your workflow as is. Note that I fill empty pango_lineage fields with ? so that the prefiltering is no longer necessary if you use that python script:

import pandas as pd
from pango_aliasor.aliasor import Aliasor
import argparse


def add_unaliased_column(tsv_file_path, pango_column='pango_lineage', unaliased_column='pango_lineage_unaliased'):
    aliasor = Aliasor()
    def uncompress_lineage(lineage):
        if not lineage or pd.isna(lineage):
            return "?"
        return aliasor.uncompress(lineage)

    df = pd.read_csv(tsv_file_path, sep='\t')
    df[unaliased_column] = df[pango_column].apply(uncompress_lineage)
    return df


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Add unaliased Pango lineage column to a TSV file.')
    parser.add_argument('--input-tsv', required=True, help='Path to the input TSV file.')
    parser.add_argument('--pango-column', default='pango_lineage', help='Name of the column to use for the Pango lineage.')
    parser.add_argument('--unaliased-column', default='pango_lineage_unaliased', help='Name of the column to use for the unaliased Pango lineage.')
    args = parser.parse_args()
    df = add_unaliased_column(args.input_tsv, args.pango_column, args.unaliased_column)
    print(df.to_csv(sep='\t', index=False))

I think it’s fine to use .str since augur filter --query supports arbitrary Pandas query syntax.

The main problem here is that Pandas data type inference is being applied to all metadata columns. For the chunk of metadata being processed, if all values under a column are empty, then it will be a None object which does not support .str.

This is unwanted behavior and would be resolved by reading metadata columns as string type, which is what I’m working on in this PR.

1 Like

Actually, I just realized the reason for this bug is separate from the problem I described above. I’ve written up a summary in a GitHub issue.

This is a new bug as of Augur 22.2.0, so for an immediate fix you can downgrade to an older version. If using the Docker runtime, this is how to use the last Docker image before the upgrade to Augur 22.2.0:

nextstrain build --image nextstrain/base:build-20230720T001758Z

Thanks @victorlin and @corneliusroemer. This was very helpful!
I do want to get all Norwegian XBB strains, but also relevant strains from other countries. And I also need to include some local unpublished data. For the moment I download the entire Gisaid metadata and fasta and pre-filter them using an R-script to get only XBB’s. So I will add a check for empty fields.

For the moment I download the entire Gisaid metadata and fasta and pre-filter them using an R-script to get only XBB’s

Just out of curiosity, does the R-script handle the cases of aliased XBB sublineages?

Victor appears to have fixed the bug with .str and released in Augur 22.3.0

@jonr let us know if the problem is still there with that Augur version.

@corneliusroemer sorry for the late answer.
Yes, in some cases I read this file and get the abbreviated lineage names based on the full pango nomenclature. And then further subset the gisaid metadata based on the lineage abbreviations.

Awesome that the bug is fixed! I will start a new run this afteroon on the same metadata as last time and see if it works.

1 Like

Works like a charm now :slight_smile:

2 Likes