Error in combining ndjson-files in mpox pipeline

Hi,
I’m trying to use the steps listed in the “ingest” pipeline for Mpox. I have my own fasta files that I managed to convert to an ndjson file using the “fasta-to-ndjson” script. However, after I have combined the genbank dataset and my own dataset I get an error from what I think is a “transform-field-names” script. But it’s difficult to understand exactly why this fails:

rror in rule curate:
    jobid: 1
    input: data/sequences.ndjson, data/all-geolocation-rules.tsv, defaults/annotations.tsv
    output: data/metadata_raw.tsv, results/sequences.fasta
    log: logs/curate.txt (check log file(s) for error details)
    shell:
        
        (cat data/sequences.ndjson             | ./vendored/transform-field-names                 --field-map "collected"="date" "submitted"="date_submitted" "genbank_accession"="accession" "submitting_organization"="institution"             | augur curate normalize-strings             | ./vendored/transform-strain-names                 --strain-regex ^.+$                 --backup-fields accession             | augur curate format-dates                 --date-fields date date_submitted                 --expected-date-formats %Y %Y-%m %Y-%m-%d %Y-%m-%dT%H:%M:%SZ             | ./vendored/transform-genbank-location             | augur curate titlecase                 --titlecase-fields region country division location                 --articles and d de del des di do en l la las le los nad of op sur the y                 --abbreviations USA             | ./vendored/transform-authors                 --authors-field authors                 --default-value ?                 --abbr-authors-field abbr_authors             | ./vendored/apply-geolocation-rules                 --geolocation-rules data/all-geolocation-rules.tsv             | ./vendored/merge-user-metadata                 --annotations defaults/annotations.tsv                 --id-field accession             | ./bin/ndjson-to-tsv-and-fasta                 --metadata-columns accession genbank_accession_rev strain date region country division location host date_submitted sra_accession abbr_authors reverse authors institution                 --metadata data/metadata_raw.tsv                 --fasta results/sequences.fasta                 --id-field accession                 --sequence-field sequence ) 2>> logs/curate.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job curate since they might be corrupted:
data/metadata_raw.tsv, results/sequences.fasta
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-19T120905.859379.snakemake.log

My sequences in the ndjson format looks like this:

{"strain":"202401196","reference":"NC_063383.1","location":"Norway","collected":"2024-01-01","sequence":"ATTTTACTATTTTATTTAG...."

Hi @jonr,

The log file should include more error details. Can you share the logs/curate.txt file?

In general, the method to add your own sequences to the dataset only works if your NDJSON includes all of the fields in the GenBank dataset. You can see all of the fields in the GenBank dataset by inspecting the data/genbank.ndjson file generated by the ingest workflow. For example, you can run the following within the mpox repository:

$ nextstrain build ingest data/genbank.ndjson
[...wait for workflow to complete...]
$ head -n 1 ingest/data/genbank.ndjson | jq 'keys_unsorted'

Thanks for your answer.
I see. I was manipulating the fasta header to include only collection date and country. But then I think it’s actually easier to create a separate metadata.tsv-file for my sequences and join that with the ncbi data. Is the dataset downloaded from NCBI via the ingest pipeline the same as what can be downloaded from this link in the phylogenetic pipeline?

Yes, the final outputs of the ingest workflow are available at

Thanks! I could make this work now :grinning: