Hi Albert,
This is a great question! We have a small tutorial for this: Preparing Your Metadata — Augur 23.1.1 documentation
For Augur you generally need separate sequences.fasta
and a metadata.tsv
file where the sequence headers from the fasta are also present in a column of the metadata so that the two can be linked.
Often, Genbank and GISAID sequence exports contain metadata inside the fasta sequence headers. E.g. OS123|2023-10-01|Australia
. Augur offers the augur parse
command as a convenience to split that into the required fasta and metadata tsv.
It might be able to do the job just fine for a start. At Nextstrain, we often do more further processing of the metadata in so called ingest
workflows to make it less annoying to work with, e.g. renaming fields and changing values. But this isn’t strictly necessary.
@joverlee has written a lot of tooling to make such data massaging easier with the augur curate
command. There are also various scripts we’ve developed in various repos that you could use for inspiration.
Some repositories for inspiration:
- Mpox ingest that takes data from Genbank/NCBI and produces sequences.fasta and metadata.tsv: mpox/ingest at master · nextstrain/mpox · GitHub
- General template to use as a starting point for “ingest”: pathogen-repo-template/ingest at main · nextstrain/pathogen-repo-template · GitHub
- hepatitisB ingest: hepatitisB/ingest at 7bd1b05e55a7fe0179195d13d44abaf40755d1ef · nextstrain/hepatitisB · GitHub
- Dengue ingest: dengue/ingest/README.md at f513d319055c706a11370b76a95fdad729edc1cc · nextstrain/dengue · GitHub
I hope this helps! Feel free to make a new post if you hit particular challenges!
Best,
Cornelius