KeyError: 'Nextclade_pango' in ncov build

pvanheus · May 1, 2022, 4:58pm

Hi there

My ncov builds have recently started failing with:

Error in rule join_metadata_and_nextclade_qc:
    jobid: 83
    output: results/southern_region_recent/metadata_with_nextclade_qc.tsv
    log: logs/join_metadata_and_nextclade_qc_southern_region_recent.txt (check log file(s) for error message)
    shell:
        
        python3 scripts/join-metadata-and-clades.py             results/southern_region_recent/southern_region_recent_subsampled_metadata.tsv.xz             results/southern_region_recent/nextclade_qc.tsv             -o results/southern_region_recent/metadata_with_nextclade_qc.tsv 2>&1 | tee logs/join_metadata_and_nextclade_qc_southern_region_recent.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: Submitted batch job 359420
Logfile logs/join_metadata_and_nextclade_qc_southern_region_recent.txt:
Traceback (most recent call last):
  File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Nextclade_pango'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/join-metadata-and-clades.py", line 150, in <module>
    main()
  File "scripts/join-metadata-and-clades.py", line 140, in main
    result[col] = result[col].fillna(VALUE_MISSING_DATA)
  File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'Nextclade_pango'

The complete log is here and the command run is snakemake --profile my_profiles/africa_recent -p. The build in question is this one.

I get the data from ncov-ingest and then filter by date with augur filter - none of that has changed on my side recently, so I’m not sure where this error is coming from.

rneher · May 2, 2022, 10:35am

Hi Peter,

I guess Nextclade_pango is not in your nextclade output. Are you using the latest nextclade dataset? We added this about 2 month ago. Could you check whether the nextclade input file has the column Nextclade_pango?

best,
richard

corneliusroemer · May 2, 2022, 6:23pm

In addition to what Richard said, the other possibility besides not using the latest nextclade dataset is not using the latest Nextclade CLI version.

To get Nextclade_pango you need both: recent software and recent dataset.

pvanheus · May 3, 2022, 9:14am

My environment is set up as per workflow/envs/nexstrain.yaml so I think I’ve got the latest versions of everything (see output of conda list here).

The input data is from ncov-ingest, filtered with augur filter, so the fields in the metadata file are strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy so indeed Nextclade_pango is not there. Reading the script that failed, it seems it expects the metadata file to have this column - at the point of failure this file contains columns: strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex Nextstrain_clade pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy missing_data divergence nonACGTN rare_mutations QC_missing_data QC_mixed_sites QC_rare_mutations QC_snp_clusters QC_frame_shifts QC_stop_codons frame_shifts insertions substitutions aa_substitutions clock_deviation global-open africa_recent thus not Nextclade_pango.

What step is meant to be adding this column?

Thanks,
Peter

corneliusroemer · May 3, 2022, 1:15pm

Thanks for the extra info @pvanheus!

I think that link might be incorrect, as it points at Richard’s comment instead of some list of packages:

(see output of conda list here)."

The problem looks to be in the ncov-ingest part of your pipeline then. It’s not enough just for the software to be up to date - you also need a recent sars-cov-2 dataset to get that Nextclade_pango column.

It could be that you have an outdated dataset, see the docs here: Nextclade datasets — Nextclade documentation

Whenever there’s a new dataset, we rerun Nextclade on all of the GISAID data with the new Nextclade dataset.

Theoretically, ncov-ingest downloads the most recent dataset as part of the pipeline, but I don’t know how exactly you’re using ingest.

Ingest should output a file called nextclade.tsv. To narrow down the location of the problem, can you locate that file and check its columns? It should contain Nextclade_pango.

Are you using ncov-ingest master? When we added Nextclade_pango a small change to ingest was necessary to pass that column from nextclade.tsv to metadata.tsv: Merge pull request #291 from nextstrain/feat/nextclade-pango · nextstrain/ncov-ingest@76ee829 · GitHub

pvanheus · May 3, 2022, 7:05pm

Sorry for the incorrect link. I am indeed using master of ncov-ingest. Here is the script that I run to get the latest data:

#!/bin/bash

#SBATCH -c 2
#SBATCH --mem=50G

set -e

cd /usr/people/pvh/ncov-ingest 
source $HOME/miniconda3/bin/activate 
GISAID_API_ENDPOINT=https://www.epicov.org/epi3/3p/hcov-19/export/export.json.bz2
GISAID_USERNAME_AND_PASSWORD=XXXXX:YYYYY 
export GISAID_API_ENDPOINT GISAID_USERNAME_AND_PASSWORD
# this doesn't need pipenv anymore
if [ -f data/gisaid.ndjson.new.bz2 ] ; then
  echo "Deleting old download"
  rm data/gisaid.ndjson.new*
fi

bin/fetch-from-gisaid data/gisaid.ndjson.new 

conda activate ncovingest
bin/transform-gisaid data/gisaid.ndjson.new
conda deactivate

conda activate nextstrain
augur index --sequences data/gisaid/sequences.fasta --output data/gisaid/sequences.fasta.index

The script is based on the instructions here. I don’t have a nextclade.tsv file in my data directory anywhere.

corneliusroemer · May 4, 2022, 4:23pm

Ah, I see, I think we’re getting there! I think bin/transform-gisaid just separates the FASTA headers out into a metadata.tsv file. In order to get Nextclade_pango however, you need to run nextclade on the sequences - that doesn’t seem to happen here!

So there are two options now, I think.
a) You run full ingest, including nextclade, then you can get Nextclade_pango etc.
b) You edit https://github.com/nextstrain/ncov/blob/master/scripts/join-metadata-and-clades.py in your fork so that it doesn’t expect Nextclade_pango. Specifically removing this line: https://github.com/nextstrain/ncov/blob/6ff941041e23ee7d4ab18a7e206671b403106d3b/scripts/join-metadata-and-clades.py#L21

I’m somehow confused how the join-metadata script has only started failing now. Where would it get information like "qc.snpClusters.status": "QC_snp_clusters", from if you haven’t run nextclade and hence got nextclade.tsv information that gets added to metadata.tsv.

dlu · May 5, 2022, 3:42pm

I thought the original error is complaining about Nextclade_pango column not exist in file results/southern_region_recent/nextclade_qc.tsv. This file is generated here in line 483, with a fresh nextclade dataset get in every tree run.

Then join-metadata-and-clades.py will add what’s in nextclade_qc.tsv into the large metadata table, including the Nextclade_pango column. So this column doesn’t exist in metadata until this step is done.

In case this helps.

pvanheus · October 5, 2022, 7:44am

Thanks for the tip @corneliusroemer. As the cluster that I use does not have Docker support, I adapted the ncov-ingest instructions to run via Singularity (after having built a Singularity image from the latest ncov-ingest one on Docker Hub):

SINGULARITYENV_GISAID_API_ENDPOINT=XXXX
SINGULARITYENV_GISAID_USERNAME_AND_PASSWORD=XXXX:XXXX
singularity exec /tools/containers/nextstrain/ncov-ingest.sif snakemake --configfile config/local_gisaid.yaml

it took a very long time to run the Nextclade step, but I presume that this can be sped up by giving snakemake the --cores option since the “run_nextclade” rule uses up to 64 threads.

Topic		Replies	Views
Followed data prep instructions, nextstrain fails Help and Getting Started	20	830	December 16, 2021
Error message upon running analysis Help and Getting Started	0	458	July 6, 2021
Run_pangolin and tree root issues - incomprehensible errors	0	364	November 20, 2021
Error in rule sanitize_metadata: ncov workflow Help and Getting Started	5	544	November 7, 2021
Ncov: Errors from combine_metadata.py due to unexpected behavior in sanitize_metadata.py Help and Getting Started	28	1271	May 1, 2023

KeyError: 'Nextclade_pango' in ncov build

Related topics