All I need is the seqName and Nextclade_pango columns (ideally in tsv format) for the recent sequences from GISAID. Ideally for the last few months e.g. currently July 2022 onwards by Submission date, topped up every few days.
This seems like Mission Impossible - complex data wrangling challenge (for a newbie) and endless processing. Are there any shortcuts I’m missing?
Thanks. But to get to your starting position of having a complete Nextclade TSV output, I have to download and process the entire GISAID dataset through nextclade? It seems such a huge effort (processing and time).
@mike_honey I’m sorry we can’t offer the metadata.tsv due to GISAID’s terms of use of data sharing. You can apply for API access with GISAID, then you’d get all the sequences and could run Nextclade yourself - it should take ~10hr on a 10 CPU machine.
Hi Cornelius, thanks for your guidance on this.
The metadata.tsv file I can download from GISAID does not have a Nextclade_pango column, as you expected.
I made a little repo of the scripts I’ve used to achieve this and some process notes, in case it’s useful for anyone else wanting to achieve something similar: