How to download avian influenza fasta and metadata files from GISAID or GenBank in a compatible format?

Hi Albert,

This is a great question! We have a small tutorial for this: Preparing Your Metadata — Augur 23.1.1 documentation

For Augur you generally need separate sequences.fasta and a metadata.tsv file where the sequence headers from the fasta are also present in a column of the metadata so that the two can be linked.

Often, Genbank and GISAID sequence exports contain metadata inside the fasta sequence headers. E.g. OS123|2023-10-01|Australia. Augur offers the augur parse command as a convenience to split that into the required fasta and metadata tsv.

It might be able to do the job just fine for a start. At Nextstrain, we often do more further processing of the metadata in so called ingest workflows to make it less annoying to work with, e.g. renaming fields and changing values. But this isn’t strictly necessary.

@joverlee has written a lot of tooling to make such data massaging easier with the augur curate command. There are also various scripts we’ve developed in various repos that you could use for inspiration.

Some repositories for inspiration:

I hope this helps! Feel free to make a new post if you hit particular challenges!

Best,

Cornelius

1 Like