What happens when 2 genomes have the same name but different sequences?

dlu · June 28, 2021, 8:11pm

My understanding is that Nextstrain will remove duplicates from the input datasets by looking at both sequence names and the actual sequences. What happens if 2 sequences have the same name but different sequences? Will the run fail or just throw a warning?

A related question is that Nextstrain will strip the “hCoV-19/” from sequence names. Does that happen for all input data, and before de-duplication?

We use the latest master branch.

Thank you for such a useful tool and great support to the users!

dlu · June 29, 2021, 8:07pm

Answering myself: if 2 sequences have both identical name and sequence, the duplicate will be removed quietly; if 2 sequences have identical name but different sequences, only the first one will be keep, and if user set error_on_duplicates=True, the names will be written into a record.
Reference ncov code

Topic		Replies	Views
Error in augur tree: "Duplicated sequence name" Help and Getting Started	8	1959	February 9, 2022
ERROR: Problem reading in data/example_sequences.fasta: Duplicate key '2019-nCoV'	0	416	February 1, 2021
seqName format different between GISAID FASTA All sequences package vs search results Help and Getting Started	2	374	January 14, 2023
Error: Alignment must have at least 3 sequences Help and Getting Started	9	1042	November 7, 2024
Problems with Using sanitize.sequences.py Help and Getting Started	3	468	October 26, 2021

What happens when 2 genomes have the same name but different sequences?

Related topics