Nextclade cli - shortcuts to get just seqName and Nextclade_pango for all recent GISAID samples

mike_honey · October 15, 2022, 12:28pm

All I need is the seqName and Nextclade_pango columns (ideally in tsv format) for the recent sequences from GISAID. Ideally for the last few months e.g. currently July 2022 onwards by Submission date, topped up every few days.

This seems like Mission Impossible - complex data wrangling challenge (for a newbie) and endless processing. Are there any shortcuts I’m missing?

mike_honey · October 26, 2022, 7:28am

seqkit to filter the GISAID fasta download?

ivan-aksamentov · November 14, 2022, 9:58am

Here is what I would do:

open Nextclade TSV output (or Nextstrain’s metadata.tsv*) in Excel or any other spreadsheet software
sort rows by date column (Sort data in a table - Microsoft Support)
filter rows starting from the date you need (“recent”)
select headers of columns you want to remove and press “Delete” key
rearrange column order as needed

For command-line there are several tools that can make the same thing with a command:

GitHub - BurntSushi/xsv: A fast CSV command line toolkit written in Rust.
csvkit 1.0.7 — csvkit 1.0.7 documentation
GitHub - johnkerl/miller: Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

And you could write a Python script using Pandas of you need more complex processing.

–
* - can be downloaded on GISAID; requires account and potentially some additional permissions

mike_honey · December 20, 2022, 9:21pm

Thanks. But to get to your starting position of having a complete Nextclade TSV output, I have to download and process the entire GISAID dataset through nextclade? It seems such a huge effort (processing and time).

ivan-aksamentov · December 22, 2022, 1:18pm

@mike_honey On GISAID you can download metadata.tsv - it’s is a result of nextstrain/ncov pipeline and should contain the columns you need.

corneliusroemer · December 22, 2022, 2:12pm

@ivan-aksamentov @mike_honey Unfortunately I don’t think it’s true that GISAID hosts the metadata.tsv we produce in ncov-ingest.

If they do (I can’t find it, it doesn’t seem to be available to me at least) then it may not have the Nextclade_pango column. Can you check?

Here’s all I can download:

@mike_honey I’m sorry we can’t offer the metadata.tsv due to GISAID’s terms of use of data sharing. You can apply for API access with GISAID, then you’d get all the sequences and could run Nextclade yourself - it should take ~10hr on a 10 CPU machine.

mike_honey · December 27, 2022, 9:28pm

Hi Cornelius, thanks for your guidance on this.
The metadata.tsv file I can download from GISAID does not have a Nextclade_pango column, as you expected.

mike_honey · January 14, 2023, 11:42pm

I made a little repo of the scripts I’ve used to achieve this and some process notes, in case it’s useful for anyone else wanting to achieve something similar:

Topic		Replies	Views
seqName format different between GISAID FASTA All sequences package vs search results Help and Getting Started	2	374	January 14, 2023
Regarding Extracting Nucleotide Mutations General	7	580	June 25, 2021
Select SARS-COV-2 sequence with alpha, beta, gamma, delta mutations	6	851	January 18, 2022
Updated example command needed for updated GISAID file	4	564	August 30, 2021
Guide to filtering GISAID data for division-specific SARS-CoV-2 builds Help and Getting Started	3	1511	April 17, 2024

Nextclade cli - shortcuts to get just seqName and Nextclade_pango for all recent GISAID samples

Related topics