Error in combining ndjson-files in mpox pipeline

jonr · March 19, 2024, 12:28pm

Hi,
I’m trying to use the steps listed in the “ingest” pipeline for Mpox. I have my own fasta files that I managed to convert to an ndjson file using the “fasta-to-ndjson” script. However, after I have combined the genbank dataset and my own dataset I get an error from what I think is a “transform-field-names” script. But it’s difficult to understand exactly why this fails:

rror in rule curate:
    jobid: 1
    input: data/sequences.ndjson, data/all-geolocation-rules.tsv, defaults/annotations.tsv
    output: data/metadata_raw.tsv, results/sequences.fasta
    log: logs/curate.txt (check log file(s) for error details)
    shell:
        
        (cat data/sequences.ndjson             | ./vendored/transform-field-names                 --field-map "collected"="date" "submitted"="date_submitted" "genbank_accession"="accession" "submitting_organization"="institution"             | augur curate normalize-strings             | ./vendored/transform-strain-names                 --strain-regex ^.+$                 --backup-fields accession             | augur curate format-dates                 --date-fields date date_submitted                 --expected-date-formats %Y %Y-%m %Y-%m-%d %Y-%m-%dT%H:%M:%SZ             | ./vendored/transform-genbank-location             | augur curate titlecase                 --titlecase-fields region country division location                 --articles and d de del des di do en l la las le los nad of op sur the y                 --abbreviations USA             | ./vendored/transform-authors                 --authors-field authors                 --default-value ?                 --abbr-authors-field abbr_authors             | ./vendored/apply-geolocation-rules                 --geolocation-rules data/all-geolocation-rules.tsv             | ./vendored/merge-user-metadata                 --annotations defaults/annotations.tsv                 --id-field accession             | ./bin/ndjson-to-tsv-and-fasta                 --metadata-columns accession genbank_accession_rev strain date region country division location host date_submitted sra_accession abbr_authors reverse authors institution                 --metadata data/metadata_raw.tsv                 --fasta results/sequences.fasta                 --id-field accession                 --sequence-field sequence ) 2>> logs/curate.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job curate since they might be corrupted:
data/metadata_raw.tsv, results/sequences.fasta
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-19T120905.859379.snakemake.log

My sequences in the ndjson format looks like this:

{"strain":"202401196","reference":"NC_063383.1","location":"Norway","collected":"2024-01-01","sequence":"ATTTTACTATTTTATTTAG...."

joverlee · March 19, 2024, 6:19pm

Hi @jonr,

The log file should include more error details. Can you share the logs/curate.txt file?

In general, the method to add your own sequences to the dataset only works if your NDJSON includes all of the fields in the GenBank dataset. You can see all of the fields in the GenBank dataset by inspecting the data/genbank.ndjson file generated by the ingest workflow. For example, you can run the following within the mpox repository:

$ nextstrain build ingest data/genbank.ndjson
[...wait for workflow to complete...]
$ head -n 1 ingest/data/genbank.ndjson | jq 'keys_unsorted'

jonr · March 20, 2024, 8:25am

Thanks for your answer.
I see. I was manipulating the fasta header to include only collection date and country. But then I think it’s actually easier to create a separate metadata.tsv-file for my sequences and join that with the ncbi data. Is the dataset downloaded from NCBI via the ingest pipeline the same as what can be downloaded from this link in the phylogenetic pipeline?

joverlee · March 20, 2024, 4:53pm

Yes, the final outputs of the ingest workflow are available at

jonr · March 21, 2024, 11:35am

Thanks! I could make this work now

Topic		Replies	Views
Broken workflow after updating runtime and CLI Help and Getting Started	3	412	December 15, 2022
ERROR: Problem reading in data/example_sequences.fasta: Duplicate key '2019-nCoV'	0	416	February 1, 2021
How to download avian influenza fasta and metadata files from GISAID or GenBank in a compatible format? Help and Getting Started	1	532	December 15, 2023
Ingest workflow: NCBI Datasets, Entrez, or something else? Help and Getting Started	1	99	June 28, 2024
Error in Rule Refine Help and Getting Started	6	607	March 15, 2021

Error in combining ndjson-files in mpox pipeline

Related topics