Nextclade cli - shortcuts to get just seqName and Nextclade_pango for all recent GISAID samples

All I need is the seqName and Nextclade_pango columns (ideally in tsv format) for the recent sequences from GISAID. Ideally for the last few months e.g. currently July 2022 onwards by Submission date, topped up every few days.

This seems like Mission Impossible - complex data wrangling challenge (for a newbie) and endless processing. Are there any shortcuts I’m missing?

seqkit to filter the GISAID fasta download?

Here is what I would do:

  • open Nextclade TSV output (or Nextstrain’s metadata.tsv*) in Excel or any other spreadsheet software
  • sort rows by date column (Sort data in a table - Microsoft Support)
  • filter rows starting from the date you need (“recent”)
  • select headers of columns you want to remove and press “Delete” key
  • rearrange column order as needed

For command-line there are several tools that can make the same thing with a command:

And you could write a Python script using Pandas of you need more complex processing.


* - can be downloaded on GISAID; requires account and potentially some additional permissions