All I need is the seqName and Nextclade_pango columns (ideally in tsv format) for the recent sequences from GISAID. Ideally for the last few months e.g. currently July 2022 onwards by Submission date, topped up every few days.
This seems like Mission Impossible - complex data wrangling challenge (for a newbie) and endless processing. Are there any shortcuts I’m missing?
seqkit to filter the GISAID fasta download?
Here is what I would do:
- open Nextclade TSV output (or Nextstrain’s metadata.tsv*) in Excel or any other spreadsheet software
- sort rows by date column (Sort data in a table - Microsoft Support)
- filter rows starting from the date you need (“recent”)
- select headers of columns you want to remove and press “Delete” key
- rearrange column order as needed
For command-line there are several tools that can make the same thing with a command:
And you could write a Python script using Pandas of you need more complex processing.
* - can be downloaded on GISAID; requires account and potentially some additional permissions
Thanks. But to get to your starting position of having a complete Nextclade TSV output, I have to download and process the entire GISAID dataset through nextclade? It seems such a huge effort (processing and time).
@mike_honey On GISAID you can download
metadata.tsv - it’s is a result of nextstrain/ncov pipeline and should contain the columns you need.
@ivan-aksamentov @mike_honey Unfortunately I don’t think it’s true that GISAID hosts the
metadata.tsv we produce in
If they do (I can’t find it, it doesn’t seem to be available to me at least) then it may not have the
Nextclade_pango column. Can you check?
Here’s all I can download:
@mike_honey I’m sorry we can’t offer the
Hi Cornelius, thanks for your guidance on this.
The metadata.tsv file I can download from GISAID does not have a Nextclade_pango column, as you expected.
I made a little repo of the scripts I’ve used to achieve this and some process notes, in case it’s useful for anyone else wanting to achieve something similar: