seqName format different between GISAID FASTA All sequences package vs search results

mike_honey · January 5, 2023, 8:44am

I noticed the seqName column in the tsv output from nextclade CLI seems to come in two different formats:

the output from GISAID’s FASTA “All sequences” package has the format: [Virus name]|[Collection date]|[Submission date]
the output from GISAID search results downloaded in FASTA format shows [Virus name]||[Accession ID]||[Collection date]

I’m guessing this might be an artefact of how the data comes out of GISAID? If so it would be good to confirm, so others are aware and don’t waste time on this. My starting assumption was seqName would be a consistent key to the GISAID record, regardless of source.

ivan-aksamentov · January 7, 2023, 11:59pm

Nextclade passes sequence names through as is. We never attempt to parse them, because they are a complete and utter mess, as you see.

That’s definitely a problem with the fasta source. This kind of discrepancies happen on GISAID and in other genomic databases.

Note also, that sequence name, no mater how it is laid out, is not guaranteed to be unique. There could be duplicate names (actually, there are many). Internally, Nextclade relies on index of the entry in the fasta file to identify samples uniquely and to ensure that rows/entries/elements in output tsv, fasta and json files relate to the same thing, no matter how this thing it is named.

But this is not true if you want to correlate inputs and outputs of Nextclade. Due to parallel processing, order of outputs can change, compared to inputs. If you suspect there are problems with uniqueness, and you want to correlate inputs and outputs, it is a good idea to add --in-order flag so that Nextclade preserves the same order in outputs as in inputs (with a small runtime performance penalty).

mike_honey · January 14, 2023, 11:40pm

Thanks for confirming Ivan.
In case someone else stumbles into this issue, I wrote a script to “fix” them by reference to a GISAID metadata.tsv file:

github.com

Mike-Honey/nextclade-tools/blob/main/nextclade-fix-seqName.py

import datetime
import os
import subprocess
import pandas


def main():
    """
    Main - program execute
    Fixes the seqName column in nextclade output. For output sourced from GISAID's "All sequences" FASTA file (output as sequences.tsv), the format is different. 
    Lookup the consistent value using the metadata.tsv file (from GISAID), assuming that is at least as up-to-date and your sequences.tsv file
    """
    print (str(datetime.datetime.now()) + ' Starting ...')
    datadir = 'C:/Dev/nextclade/'
    filename_metadata = 'C:/Dev/sars-cov-2-genomes/metadata.tsv'
    needed_columns_metadata = ['Virus name', 'Accession ID', 'Collection date', 'Submission date']

    print (str(datetime.datetime.now()) + ' Reading and deriving columns from: ' + filename_metadata)
    df_metadata_in = pandas.read_csv(filename_metadata, usecols = needed_columns_metadata, sep="\t")
    df_metadata_in['seqName new'] = df_metadata_in['Virus name'] + '|' + df_metadata_in["Accession ID"] + '|' + df_metadata_in["Collection date"]

This file has been truncated. show original

Topic		Replies	Views
Nextclade cli - shortcuts to get just seqName and Nextclade_pango for all recent GISAID samples Help and Getting Started	7	503	January 14, 2023
GISAID - nextclade designations?	1	463	May 2, 2022
Updated example command needed for updated GISAID file	4	564	August 30, 2021
Regarding Extracting Nucleotide Mutations General	7	580	June 25, 2021
How can I know what columns will be in nextclade output? Help and Getting Started	11	40	August 5, 2025

seqName format different between GISAID FASTA All sequences package vs search results

Related topics