My ncov builds have recently started failing with:
Error in rule join_metadata_and_nextclade_qc:
log: logs/join_metadata_and_nextclade_qc_southern_region_recent.txt (check log file(s) for error message)
python3 scripts/join-metadata-and-clades.py results/southern_region_recent/southern_region_recent_subsampled_metadata.tsv.xz results/southern_region_recent/nextclade_qc.tsv -o results/southern_region_recent/metadata_with_nextclade_qc.tsv 2>&1 | tee logs/join_metadata_and_nextclade_qc_southern_region_recent.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
cluster_jobid: Submitted batch job 359420
Traceback (most recent call last):
File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts/join-metadata-and-clades.py", line 150, in <module>
File "scripts/join-metadata-and-clades.py", line 140, in main
result[col] = result[col].fillna(VALUE_MISSING_DATA)
File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
The complete log is here and the command run is
snakemake --profile my_profiles/africa_recent -p. The build in question is this one.
I get the data from
ncov-ingest and then filter by date with
augur filter - none of that has changed on my side recently, so I’m not sure where this error is coming from.
Nextclade_pango is not in your nextclade output. Are you using the latest nextclade dataset? We added this about 2 month ago. Could you check whether the nextclade input file has the column
In addition to what Richard said, the other possibility besides not using the latest nextclade dataset is not using the latest Nextclade CLI version.
Nextclade_pango you need both: recent software and recent dataset.
My environment is set up as per
workflow/envs/nexstrain.yaml so I think I’ve got the latest versions of everything (see output of
conda list here).
The input data is from
ncov-ingest, filtered with
augur filter, so the fields in the metadata file are
strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy so indeed
Nextclade_pango is not there. Reading the script that failed, it seems it expects the metadata file to have this column - at the point of failure this file contains columns:
strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex Nextstrain_clade pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy missing_data divergence nonACGTN rare_mutations QC_missing_data QC_mixed_sites QC_rare_mutations QC_snp_clusters QC_frame_shifts QC_stop_codons frame_shifts insertions substitutions aa_substitutions clock_deviation global-open africa_recent thus not
What step is meant to be adding this column?
Thanks for the extra info @pvanheus!
I think that link might be incorrect, as it points at Richard’s comment instead of some list of packages:
(see output of
conda list here)."
The problem looks to be in the
ncov-ingest part of your pipeline then. It’s not enough just for the software to be up to date - you also need a recent
sars-cov-2 dataset to get that
It could be that you have an outdated dataset, see the docs here: Nextclade datasets — Nextclade documentation
Whenever there’s a new dataset, we rerun Nextclade on all of the GISAID data with the new Nextclade dataset.
ncov-ingest downloads the most recent dataset as part of the pipeline, but I don’t know how exactly you’re using ingest.
Ingest should output a file called
nextclade.tsv. To narrow down the location of the problem, can you locate that file and check its columns? It should contain
Are you using
ncov-ingest master? When we added
Nextclade_pango a small change to ingest was necessary to pass that column from
metadata.tsv: Merge pull request #291 from nextstrain/feat/nextclade-pango · nextstrain/ncov-ingest@76ee829 · GitHub
Sorry for the incorrect link. I am indeed using master of ncov-ingest. Here is the script that I run to get the latest data:
#SBATCH -c 2
export GISAID_API_ENDPOINT GISAID_USERNAME_AND_PASSWORD
# this doesn't need pipenv anymore
if [ -f data/gisaid.ndjson.new.bz2 ] ; then
echo "Deleting old download"
conda activate ncovingest
conda activate nextstrain
augur index --sequences data/gisaid/sequences.fasta --output data/gisaid/sequences.fasta.index
The script is based on the instructions here. I don’t have a
nextclade.tsv file in my data directory anywhere.
Ah, I see, I think we’re getting there! I think
bin/transform-gisaid just separates the FASTA headers out into a
metadata.tsv file. In order to get
Nextclade_pango however, you need to run
nextclade on the sequences - that doesn’t seem to happen here!
So there are two options now, I think.
a) You run full ingest, including
nextclade, then you can get
b) You edit ncov/join-metadata-and-clades.py at master · nextstrain/ncov · GitHub in your fork so that it doesn’t expect
Nextclade_pango. Specifically removing this line: ncov/join-metadata-and-clades.py at 6ff941041e23ee7d4ab18a7e206671b403106d3b · nextstrain/ncov · GitHub
I’m somehow confused how the join-metadata script has only started failing now. Where would it get information like
"qc.snpClusters.status": "QC_snp_clusters", from if you haven’t run
nextclade and hence got
nextclade.tsv information that gets added to
I thought the original error is complaining about
Nextclade_pango column not exist in file
results/southern_region_recent/nextclade_qc.tsv. This file is generated here in line 483, with a fresh
nextclade dataset get in every tree run.
join-metadata-and-clades.py will add what’s in
nextclade_qc.tsv into the large metadata table, including the
Nextclade_pango column. So this column doesn’t exist in metadata until this step is done.
In case this helps.