Nextclade cli - shortcuts to get just seqName and Nextclade_pango for all recent GISAID samples

All I need is the seqName and Nextclade_pango columns (ideally in tsv format) for the recent sequences from GISAID. Ideally for the last few months e.g. currently July 2022 onwards by Submission date, topped up every few days.

This seems like Mission Impossible - complex data wrangling challenge (for a newbie) and endless processing. Are there any shortcuts I’m missing?

seqkit to filter the GISAID fasta download?

Here is what I would do:

  • open Nextclade TSV output (or Nextstrain’s metadata.tsv*) in Excel or any other spreadsheet software
  • sort rows by date column (Sort data in a table - Microsoft Support)
  • filter rows starting from the date you need (“recent”)
  • select headers of columns you want to remove and press “Delete” key
  • rearrange column order as needed

For command-line there are several tools that can make the same thing with a command:

And you could write a Python script using Pandas of you need more complex processing.


* - can be downloaded on GISAID; requires account and potentially some additional permissions

Thanks. But to get to your starting position of having a complete Nextclade TSV output, I have to download and process the entire GISAID dataset through nextclade? It seems such a huge effort (processing and time).

@mike_honey On GISAID you can download metadata.tsv - it’s is a result of nextstrain/ncov pipeline and should contain the columns you need.

@ivan-aksamentov @mike_honey Unfortunately I don’t think it’s true that GISAID hosts the metadata.tsv we produce in ncov-ingest.

If they do (I can’t find it, it doesn’t seem to be available to me at least) then it may not have the Nextclade_pango column. Can you check?

Here’s all I can download:

@mike_honey I’m sorry we can’t offer the metadata.tsv due to GISAID’s terms of use of data sharing. You can apply for API access with GISAID, then you’d get all the sequences and could run Nextclade yourself - it should take ~10hr on a 10 CPU machine.

Hi Cornelius, thanks for your guidance on this.
The metadata.tsv file I can download from GISAID does not have a Nextclade_pango column, as you expected.

I made a little repo of the scripts I’ve used to achieve this and some process notes, in case it’s useful for anyone else wanting to achieve something similar:

1 Like