Hello! I am very new to Nextstrain and looking for some guidance/recommendations regarding the development of an ingest workflow. I am ultimately looking to create a Nextclade dataset of a new viral pathogen (a reovirus).
Here’s my conundrum: If I use the NCBI Datasets CLI tool for downloading viral sequences and associated metadata, I’m missing important data from other fields not parsed by NCBI Datasets (segment, references, strain, CDS note, etc.). However, the alternative, NCBI Entrez tool, looks quite a bit more complicated given I would have to write my own script to parse the GenBank file into a flat JSON Lines/NDJSON formatted file. My other thought was to use the NCBI Datasets tool and then merge in the missing metadata later on with the ingest/vendored/merge-user-metadata
script. However, I would have to manually curate the metadata before merging and this would get tedious since I hope to periodically rerun the ingest workflow to update the dataset.
Does anyone have any suggestions of the best approach to take? Any help would be much appreciated. Thanks so much for reading!