I noticed the seqName column in the tsv output from nextclade CLI seems to come in two different formats:
the output from GISAID’s FASTA “All sequences” package has the format: [Virus name]|[Collection date]|[Submission date]
the output from GISAID search results downloaded in FASTA format shows [Virus name]||[Accession ID]||[Collection date]
I’m guessing this might be an artefact of how the data comes out of GISAID? If so it would be good to confirm, so others are aware and don’t waste time on this. My starting assumption was seqName would be a consistent key to the GISAID record, regardless of source.
Nextclade passes sequence names through as is. We never attempt to parse them, because they are a complete and utter mess, as you see.
That’s definitely a problem with the fasta source. This kind of discrepancies happen on GISAID and in other genomic databases.
Note also, that sequence name, no mater how it is laid out, is not guaranteed to be unique. There could be duplicate names (actually, there are many). Internally, Nextclade relies on index of the entry in the fasta file to identify samples uniquely and to ensure that rows/entries/elements in output tsv, fasta and json files relate to the same thing, no matter how this thing it is named.
But this is not true if you want to correlate inputs and outputs of Nextclade. Due to parallel processing, order of outputs can change, compared to inputs. If you suspect there are problems with uniqueness, and you want to correlate inputs and outputs, it is a good idea to add --in-order flag so that Nextclade preserves the same order in outputs as in inputs (with a small runtime performance penalty).