Ingest workflow: NCBI Datasets, Entrez, or something else?

eam · June 28, 2024, 3:23pm

Hello! I am very new to Nextstrain and looking for some guidance/recommendations regarding the development of an ingest workflow. I am ultimately looking to create a Nextclade dataset of a new viral pathogen (a reovirus).

Here’s my conundrum: If I use the NCBI Datasets CLI tool for downloading viral sequences and associated metadata, I’m missing important data from other fields not parsed by NCBI Datasets (segment, references, strain, CDS note, etc.). However, the alternative, NCBI Entrez tool, looks quite a bit more complicated given I would have to write my own script to parse the GenBank file into a flat JSON Lines/NDJSON formatted file. My other thought was to use the NCBI Datasets tool and then merge in the missing metadata later on with the ingest/vendored/merge-user-metadata script. However, I would have to manually curate the metadata before merging and this would get tedious since I hope to periodically rerun the ingest workflow to update the dataset.

Does anyone have any suggestions of the best approach to take? Any help would be much appreciated. Thanks so much for reading!

nextstrain-team-bot · June 28, 2024, 8:10pm

Hi @eam,

I ran into the exact issue with ingesting avian-flu H5N1sequences not having segment in the metadata. I ended up using a combination of NCBI Datasets and NCBI Virus. You can find the workflow for at avian-flu/ingest/build-configs/ncbi/rules/fetch_from_ncbi.smk at master · nextstrain/avian-flu · GitHub.

Best,
Jover

Topic		Replies	Views
Preparing my own data Help and Getting Started	1	357	April 10, 2022
GISAID - nextclade designations?	1	460	May 2, 2022
Using existing alignment Help and Getting Started	5	535	January 29, 2022
Nextstrain community builds no dataset available Help and Getting Started	1	350	April 3, 2022
S3 or URL links for workflow data Help and Getting Started	6	91	July 3, 2024

Ingest workflow: NCBI Datasets, Entrez, or something else?

Related topics