How to download avian influenza fasta and metadata files from GISAID or GenBank in a compatible format?

Hello,

I am trying to download my FASTA file and metadata file from GISAID in a format compatible with the Augur pipeline to create my own database. However, it seems this is only possible from the SARS-CoV-2 database. Do you have any advice on how I can do this with avian influenza data as well? Similarly, I am wondering how to download data from GenBank in a compatible format as well.

Thank you very much,
Albert

1 Like

Hi Albert,

This is a great question! We have a small tutorial for this: Preparing Your Metadata — Augur 23.1.1 documentation

For Augur you generally need separate sequences.fasta and a metadata.tsv file where the sequence headers from the fasta are also present in a column of the metadata so that the two can be linked.

Often, Genbank and GISAID sequence exports contain metadata inside the fasta sequence headers. E.g. OS123|2023-10-01|Australia. Augur offers the augur parse command as a convenience to split that into the required fasta and metadata tsv.

It might be able to do the job just fine for a start. At Nextstrain, we often do more further processing of the metadata in so called ingest workflows to make it less annoying to work with, e.g. renaming fields and changing values. But this isn’t strictly necessary.

@joverlee has written a lot of tooling to make such data massaging easier with the augur curate command. There are also various scripts we’ve developed in various repos that you could use for inspiration.

Some repositories for inspiration:

I hope this helps! Feel free to make a new post if you hit particular challenges!

Best,

Cornelius

1 Like