My ncov builds have recently started failing with:
Error in rule join_metadata_and_nextclade_qc:
jobid: 83
output: results/southern_region_recent/metadata_with_nextclade_qc.tsv
log: logs/join_metadata_and_nextclade_qc_southern_region_recent.txt (check log file(s) for error message)
shell:
python3 scripts/join-metadata-and-clades.py results/southern_region_recent/southern_region_recent_subsampled_metadata.tsv.xz results/southern_region_recent/nextclade_qc.tsv -o results/southern_region_recent/metadata_with_nextclade_qc.tsv 2>&1 | tee logs/join_metadata_and_nextclade_qc_southern_region_recent.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
cluster_jobid: Submitted batch job 359420
Logfile logs/join_metadata_and_nextclade_qc_southern_region_recent.txt:
Traceback (most recent call last):
File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Nextclade_pango'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts/join-metadata-and-clades.py", line 150, in <module>
main()
File "scripts/join-metadata-and-clades.py", line 140, in main
result[col] = result[col].fillna(VALUE_MISSING_DATA)
File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "/usr/people/pvh/miniconda3/envs/nextstrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'Nextclade_pango'
The complete log is here and the command run is snakemake --profile my_profiles/africa_recent -p. The build in question is this one.
I get the data from ncov-ingest and then filter by date with augur filter - none of that has changed on my side recently, so I’m not sure where this error is coming from.
I guess Nextclade_pango is not in your nextclade output. Are you using the latest nextclade dataset? We added this about 2 month ago. Could you check whether the nextclade input file has the column Nextclade_pango?
My environment is set up as per workflow/envs/nexstrain.yaml so I think I’ve got the latest versions of everything (see output of conda listhere).
The input data is from ncov-ingest, filtered with augur filter, so the fields in the metadata file are strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy so indeed Nextclade_pango is not there. Reading the script that failed, it seems it expects the metadata file to have this column - at the point of failure this file contains columns: strain virus gisaid_epi_isl genbank_accession sra_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex Nextstrain_clade pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted sampling_strategy missing_data divergence nonACGTN rare_mutations QC_missing_data QC_mixed_sites QC_rare_mutations QC_snp_clusters QC_frame_shifts QC_stop_codons frame_shifts insertions substitutions aa_substitutions clock_deviation global-open africa_recent thus not Nextclade_pango.
The problem looks to be in the ncov-ingest part of your pipeline then. It’s not enough just for the software to be up to date - you also need a recent sars-cov-2 dataset to get that Nextclade_pango column.
Whenever there’s a new dataset, we rerun Nextclade on all of the GISAID data with the new Nextclade dataset.
Theoretically, ncov-ingest downloads the most recent dataset as part of the pipeline, but I don’t know how exactly you’re using ingest.
Ingest should output a file called nextclade.tsv. To narrow down the location of the problem, can you locate that file and check its columns? It should contain Nextclade_pango.
Ah, I see, I think we’re getting there! I think bin/transform-gisaid just separates the FASTA headers out into a metadata.tsv file. In order to get Nextclade_pango however, you need to run nextclade on the sequences - that doesn’t seem to happen here!
I’m somehow confused how the join-metadata script has only started failing now. Where would it get information like "qc.snpClusters.status": "QC_snp_clusters", from if you haven’t run nextclade and hence got nextclade.tsv information that gets added to metadata.tsv.
I thought the original error is complaining about Nextclade_pango column not exist in file results/southern_region_recent/nextclade_qc.tsv. This file is generated here in line 483, with a fresh nextclade dataset get in every tree run.
Then join-metadata-and-clades.py will add what’s in nextclade_qc.tsv into the large metadata table, including the Nextclade_pango column. So this column doesn’t exist in metadata until this step is done.
Thanks for the tip @corneliusroemer. As the cluster that I use does not have Docker support, I adapted the ncov-ingest instructions to run via Singularity (after having built a Singularity image from the latest ncov-ingest one on Docker Hub):
it took a very long time to run the Nextclade step, but I presume that this can be sped up by giving snakemake the --cores option since the “run_nextclade” rule uses up to 64 threads.