KeyError: 'Nextclade_pango' in ncov build

Hi there

My ncov builds have recently started failing with:

Error in rule join_metadata_and_nextclade_qc:
    jobid: 83
    output: results/southern_region_recent/metadata_with_nextclade_qc.tsv
    log: logs/join_metadata_and_nextclade_qc_southern_region_recent.txt (check log file(s) for error message)
    shell:
        
        python3 scripts/join-metadata-and-clades.py             results/southern_region_recent/southern_region_recent_subsampled_metadata.tsv.xz             results/southern_region_recent/nextclade_qc.tsv             -o results/southern_region_recent/metadata_with_nextclade_qc.tsv 2>&1 | tee logs/join_metadata_and_nextclade_qc_southern_region_recent.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: Submitted batch job 359420
Logfile logs/join_metadata_and_nextclade_qc_southern_region_recent.txt:
Traceback (most recent call last):
  File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Nextclade_pango'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/join-metadata-and-clades.py", line 150, in <module>
    main()
  File "scripts/join-metadata-and-clades.py", line 140, in main
    result[col] = result[col].fillna(VALUE_MISSING_DATA)
  File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'Nextclade_pango'

The complete log is here and the command run is snakemake --profile my_profiles/africa_recent -p. The build in question is this one.

I get the data from ncov-ingest and then filter by date with augur filter - none of that has changed on my side recently, so I’m not sure where this error is coming from.

Hi Peter,

I guess Nextclade_pango is not in your nextclade output. Are you using the latest nextclade dataset? We added this about 2 month ago. Could you check whether the nextclade input file has the column Nextclade_pango?

best,
richard

1 Like

In addition to what Richard said, the other possibility besides not using the latest nextclade dataset is not using the latest Nextclade CLI version.

To get Nextclade_pango you need both: recent software and recent dataset.

My environment is set up as per workflow/envs/nexstrain.yaml so I think I’ve got the latest versions of everything (see output of conda list here).

The input data is from ncov-ingest, filtered with augur filter, so the fields in the metadata file are strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy so indeed Nextclade_pango is not there. Reading the script that failed, it seems it expects the metadata file to have this column - at the point of failure this file contains columns: strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex Nextstrain_clade pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy missing_data divergence nonACGTN rare_mutations QC_missing_data QC_mixed_sites QC_rare_mutations QC_snp_clusters QC_frame_shifts QC_stop_codons frame_shifts insertions substitutions aa_substitutions clock_deviation global-open africa_recent thus not Nextclade_pango.

What step is meant to be adding this column?

Thanks,
Peter

Thanks for the extra info @pvanheus!

I think that link might be incorrect, as it points at Richard’s comment instead of some list of packages:

(see output of conda list here)."

The problem looks to be in the ncov-ingest part of your pipeline then. It’s not enough just for the software to be up to date - you also need a recent sars-cov-2 dataset to get that Nextclade_pango column.

It could be that you have an outdated dataset, see the docs here: Nextclade datasets — Nextclade documentation

Whenever there’s a new dataset, we rerun Nextclade on all of the GISAID data with the new Nextclade dataset.

Theoretically, ncov-ingest downloads the most recent dataset as part of the pipeline, but I don’t know how exactly you’re using ingest.

Ingest should output a file called nextclade.tsv. To narrow down the location of the problem, can you locate that file and check its columns? It should contain Nextclade_pango.

Are you using ncov-ingest master? When we added Nextclade_pango a small change to ingest was necessary to pass that column from nextclade.tsv to metadata.tsv: Merge pull request #291 from nextstrain/feat/nextclade-pango · nextstrain/ncov-ingest@76ee829 · GitHub

Sorry for the incorrect link. I am indeed using master of ncov-ingest. Here is the script that I run to get the latest data:

#!/bin/bash

#SBATCH -c 2
#SBATCH --mem=50G

set -e

cd /usr/people/pvh/ncov-ingest 
source $HOME/miniconda3/bin/activate 
GISAID_API_ENDPOINT=https://www.epicov.org/epi3/3p/hcov-19/export/export.json.bz2
GISAID_USERNAME_AND_PASSWORD=XXXXX:YYYYY 
export GISAID_API_ENDPOINT GISAID_USERNAME_AND_PASSWORD
# this doesn't need pipenv anymore
if [ -f data/gisaid.ndjson.new.bz2 ] ; then
  echo "Deleting old download"
  rm data/gisaid.ndjson.new*
fi

bin/fetch-from-gisaid data/gisaid.ndjson.new 

conda activate ncovingest
bin/transform-gisaid data/gisaid.ndjson.new
conda deactivate

conda activate nextstrain
augur index --sequences data/gisaid/sequences.fasta --output data/gisaid/sequences.fasta.index

The script is based on the instructions here. I don’t have a nextclade.tsv file in my data directory anywhere.

Ah, I see, I think we’re getting there! I think bin/transform-gisaid just separates the FASTA headers out into a metadata.tsv file. In order to get Nextclade_pango however, you need to run nextclade on the sequences - that doesn’t seem to happen here!

So there are two options now, I think.
a) You run full ingest, including nextclade, then you can get Nextclade_pango etc.
b) You edit ncov/join-metadata-and-clades.py at master · nextstrain/ncov · GitHub in your fork so that it doesn’t expect Nextclade_pango. Specifically removing this line: ncov/join-metadata-and-clades.py at 6ff941041e23ee7d4ab18a7e206671b403106d3b · nextstrain/ncov · GitHub

I’m somehow confused how the join-metadata script has only started failing now. Where would it get information like "qc.snpClusters.status": "QC_snp_clusters", from if you haven’t run nextclade and hence got nextclade.tsv information that gets added to metadata.tsv.

I thought the original error is complaining about Nextclade_pango column not exist in file results/southern_region_recent/nextclade_qc.tsv. This file is generated here in line 483, with a fresh nextclade dataset get in every tree run.

Then join-metadata-and-clades.py will add what’s in nextclade_qc.tsv into the large metadata table, including the Nextclade_pango column. So this column doesn’t exist in metadata until this step is done.

In case this helps.