Metadata.tsv: building custom nextclade dataset

Hi! I’m in the process of building a custom dataset in the nextclade_data repo, and have now reached the point of tree building with the snakefile provided in example-workflow. I want to hand select the genomes to go into the tree (since they have varying taxon ids in ncbi), and I’m not sure how to construct the metadata.tsv file to accompany the sequences. Is there some general script/workflow people have used to create this metadata.tsv file? Seems like an awful lot of work to do it manually, at least if there is an alternative already in place.

I haven’t done this myself, but I believe the Nextstrain group creates an ingest subdirectory in every build, with a fairly standardized workflow, such as mumps/ingest whose README.md refers to a vendored/README.md where the basic idea is that the repo nextstrain/shared (aka nextstrain/ingest) provides shared scripts that are used with some config files (like mumps/ingest/defaults/config.yaml for mumps). That approach makes a lot of sense if you are going to maintain a bunch of different virus builds as Nextstrain does. There’s a bit of a learning curve there, but it’s still probably less work than a manual construction!

And now for the shameless plug – I am working on a tool called viral_usher that automates downloading of sequences and metadata from GenBank and construction of a phylogenetic tree with UShER. If you already have a recent version of Python and Docker then it’s easy to install (pip install viral_usher), and it’s easy to run: first run viral_usher init and follow the prompts (starting with entering the name of the species you’re working on, although you might have to resort to supplying a higher-level Taxonomy ID on the command line if the species-search results don’t encompass your taxa). That will suggest a viral_usher build --config ... command that will download sequences and metadata and build a tree.

You probably won’t have much use for the UShER tree generated by viral_usher since you are already building your own tree – but as a byproduct, viral_usher creates two files that you can use as Nextstrain/Augur inputs (starting with the augur align step) with just a little preprocessing: genbank.fasta.xz and metadata.tsv.gz. Then again, you already have your own FASTA of hand-selected GenBank sequences, so metadata.tsv.gz is probably the only output file that you need. You’d want to grep for your hand-picked accessions, and add -XX suffixes to any incomplete dates.

If you would like to give viral_usher a try, let me know and I’d be glad to answer any questions and provide grep and awk commands to select lines of metadata and add the -XX date suffixes.

Hello! Feel free to borrow any of the existing workflows for creating metadata.tsv and sequences.fasta files. I’m not sure which virus you’re working on, but you can swap in your own “ncbi_taxon_id” in the config.yaml file. That way you should be able to run:

# Clone one of the existing pathogen workflows, using mumps as an example
git clone https://github.com/nextstrain/mumps.git
cd mumps
# change ncbi_taxon_id to your virus of interest in ingest/defaults/config.yaml
nextstrain build ingest
ls ingest/results/metadata.tsv
ls ingest/results/sequences.fasta

Or better yet, we have some documentation on creating your own ingest workflow here

When building a tree for nextclade_data, you can specify a list of NCBI genbank entries to keep in an include.txt file:

augur filter \
  --sequences ingest/results/sequences.fasta \
  --metadata ingest/results/metadata.tsv \
  --metadata-id-columns accession \
  --exclude-all \
  --include include.txt \
  --output-sequences filtered.fasta \
  --output-metadata filtered.tsv