Ingest workflow: NCBI Datasets, Entrez, or something else?

Hello! I am very new to Nextstrain and looking for some guidance/recommendations regarding the development of an ingest workflow. I am ultimately looking to create a Nextclade dataset of a new viral pathogen (a reovirus).

Here’s my conundrum: If I use the NCBI Datasets CLI tool for downloading viral sequences and associated metadata, I’m missing important data from other fields not parsed by NCBI Datasets (segment, references, strain, CDS note, etc.). However, the alternative, NCBI Entrez tool, looks quite a bit more complicated given I would have to write my own script to parse the GenBank file into a flat JSON Lines/NDJSON formatted file. My other thought was to use the NCBI Datasets tool and then merge in the missing metadata later on with the ingest/vendored/merge-user-metadata script. However, I would have to manually curate the metadata before merging and this would get tedious since I hope to periodically rerun the ingest workflow to update the dataset.

Does anyone have any suggestions of the best approach to take? Any help would be much appreciated. Thanks so much for reading!

Hi @eam,

I ran into the exact issue with ingesting avian-flu H5N1sequences not having segment in the metadata. I ended up using a combination of NCBI Datasets and NCBI Virus. You can find the workflow for at avian-flu/ingest/build-configs/ncbi/rules/fetch_from_ncbi.smk at master · nextstrain/avian-flu · GitHub.

Best,
Jover

1 Like