Error in augur tree: "Duplicated sequence name"

sacundim · January 31, 2022, 8:11am

Hi there! I’m running a build that I’ve more or less copy-pasted and adapted from ncov/open (= GenBank), using the prebuilt intermediate aligned.fasta.xz file. My build.yaml is here:

covid-19-puerto-rico-nextstrain/builds.yaml at master · sacundim/covid-19-puerto-rico-nextstrain · GitHub

This build ran fine on January 23, and I haven’t made any changes to my repo since then, but I nevertheless get a failure during the augur tree step now (see below). I also tried applying the latest commits from the upstream ncov repo master branch since my build (I was up to commit 983f7953), but it didn’t help either.

Some of the log output from my nextstrain build command. There’s of couse a lot more, I’ve tried a bit blindly to guess the most obviously relevant bits, if there’s something else I need to look for I can do it:

[batch]         augur tree             --alignment results/puerto-rico/filtered.fasta             --tree-builder-args '-ninit 10 -n 4'             --exclude-sites defaults/sites_ignored_for_tree_topology.txt             --output results/puerto-rico/tree_raw.nwk             --nthreads 4 2>&1 | tee logs/tree_puerto-rico.txt
[batch]         
[batch]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] Logfile logs/tree_puerto-rico.txt:
[batch] ERROR: Shell exited 2 when running: iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4 > results/puerto-rico/masked_filtered-delim.iqtree.log
[batch] Command output was:
[batch]   ERROR: Please rename sequences listed above!
[batch] 7 masking sites read from defaults/sites_ignored_for_tree_topology.txt
[batch] Building a tree via:
[batch]         iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4 > results/puerto-rico/masked_filtered-delim.iqtree.log
[batch]         Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
[batch]         Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300
[batch] ERROR: TREE BUILDING FAILED
[batch] Please see the log file for more details: results/puerto-rico/masked_filtered-delim.iqtree.log

And here’s from the masked_filtered-delim.iqtree.log file, it goes on for some 45 more lines with more “Duplicated sequence name” errors:

IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Oct 22 2020
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    5138927a076046b4919b3790bfc7761b-2470140894 (AVX512, FMA3, 15 GB RAM)
Command: iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4
Seed:    282218 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Mon Jan 31 07:39:26 2022
Kernel:  AVX+FMA - 4 threads (4 CPU cores detected)

Reading alignment file results/puerto-rico/masked_filtered-delim.fasta ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
WARNING: 157 sites contain only gaps or ambiguous characters.
Alignment has 4349 sequences with 29903 columns, 20404 distinct patterns
3683 parsimony-informative, 3733 singleton sites, 22487 constant sites
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_AZ-CDC-LC0471031_DELIM-MSFKQCUMHEHDTGBYGIOI_2021
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_CT-CDC-LC0465544_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_CT-CDC-LC0467878_DELIM-MSFKQCUMHEHDTGBYGIOI_2021
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DC-CDC-LC0462811_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DC-CDC-LC0464641_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DE-CDC-LC0461229_DELIM-MSFKQCUMHEHDTGBYGIOI_2022

trs · January 31, 2022, 6:30pm

Hi @sacundim, welcome! The errors you pulled out of the logs are probably the issue here. The first error contains:

ERROR: Please rename sequences listed above!

which I believe refers to the ERROR: Duplicated sequence name … errors in the other log file.

This build ran fine on January 23, and I haven’t made any changes to my repo since then, but I nevertheless get a failure during the augur tree step now

I suspect you’re seeing these errors now because the input data changed (and now includes duplicate sequence names) or that the current duplicates were somehow not included in previous run due to sampling.

sacundim · January 31, 2022, 8:21pm

The input data changes every day indeed, but the thing is that my input data is this:

inputs:
  - name: "open"
    metadata: "https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz"
    aligned: "https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz"
    skip_sanitize_metadata: true

…which is the dataset as the canonical ncov/open builds, e.g. this one:

auspice

…as documented here:

Overview of remote nCoV files (intermediate build assets) — SARS-CoV-2 Workflow documentation

So I don’t believe it’s the input data, it’s gotta be something with my own build’s configuration. (Which I mostly copy-pasted and modified from the canonical ones, but obviously I’ve got something wrong when I modified.)

trs · February 3, 2022, 5:37pm

I looked into this a bit more. There are actual duplicate sequence names, the ones you’re running into, in a copy of https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz that I downloaded. One example:

$ xzcat -T4 < aligned.fasta.xz | grep '^>' | grep -nF DE-CDC-LC0461229
3454031:>USA/DE-CDC-LC0461229/2022
3469494:>USA/DE-CDC-LC0461229/2022

$ sha256sum aligned.fasta.xz 
7137341cdd75befc5d36eb3fade7fcb6c00ea077ac5ce383b66d1e9b72b98cac  aligned.fasta.xz

If those sequences are selected during subsampling, then augur filter will dutifully pass all copies through. I tested this because I wasn’t sure what augur filter would do:

$ cat tmp.fasta
>one
A
>two
T
>three
C
>one
G
>four
N

$ cat meta.csv 
strain,
one,
two,
three,
four,

$ augur filter --sequences tmp.fasta --metadata meta.csv --exclude-all --include <(echo one; echo two) --output-sequences out.fasta
2 strains were dropped during filtering
        4 of these were dropped by `--exclude-all`
        2 strains were added back because they were in /dev/fd/63
2 strains passed all filters

$ cat out.fasta 
>one
A
>two
T
>one
G

So the issue is in the upstream data, but the workflow could maybe more gracefully handle this as well.

trs · February 3, 2022, 6:02pm

I’ve opened an issue in our ncov-ingest repo, which is what produces the aligned.fasta.xz used above.

corneliusroemer · February 4, 2022, 12:32am

Do we not use a seq dedup script in ncov?

We use this in ncov-simple and it works well, should be a workaround for OP until we clean up the open sequences.

github.com

neherlab/ncov-simple/blob/master/scripts/combine-and-dedup-fastas.py

import argparse
from Bio import SeqIO
import hashlib
import sys
import textwrap


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="Combine and dedup FASTAs",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument('--input', type=str,  nargs="+", metavar="FASTA", required=True, help="input FASTAs")
    parser.add_argument('--output', type=str, metavar="FASTA", required=True, help="output FASTA")
    args = parser.parse_args()

    sequence_hash_by_name = {}
    duplicate_strains = set()

This file has been truncated. show original

sacundim · February 4, 2022, 3:17am

Looks to me like the ncov workflow has similar code but it’s only used on the multiple inputs code path, if the comment in line 216 isn’t lying:

github.com

nextstrain/ncov/blob/44223bcd5249e53eb9a67fb2cc9f2efef77ff530/workflow/snakemake_rules/main_workflow.smk#L216

      
        
                    # Check format strings that haven't been resolved.
                    if re.search(r'\{.+\}', value):
                        raise Exception(f"The parameters for the subsampling scheme '{wildcards.subsample}' of build '{wildcards.build_name}' reference build attributes that are not defined in the configuration file: '{value}'. Add these build attributes to the appropriate configuration file and try again.")
            
            
        return value
            
            
    return _get_setting
            
            

            
rule combine_sequences_for_subsampling:
                # Similar to rule combine_input_metadata, this rule should only be run if multiple inputs are being used (i.e. multiple origins)
                message:
                    """
                    Combine and deduplicate aligned FASTAs from multiple origins in preparation for subsampling.
                    """
                input:
                    lambda w: [_get_path_for_input("aligned", origin) for origin in config.get("inputs", {})]
                output:
                    "results/combined_sequences_for_subsampling.fasta.xz"
                benchmark:
                    "benchmarks/combine_sequences_for_subsampling.txt"

The function that looks relevant in sanitize_sequences.py:

github.com

nextstrain/ncov/blob/44223bcd5249e53eb9a67fb2cc9f2efef77ff530/scripts/sanitize_sequences.py#L38

      
        
            
            
        # The name field stores the same information for a simple FASTA input, so we need to override its value, too.
                    sequence.name = sequence.id
            
            
        # Do not keep additional information that follows the sequence identifier.
                    sequence.description = ""
            
            
        yield sequence
            
            

            
def drop_duplicate_sequences(sequences, error_on_duplicates=False):
                """Identify and drop duplicate sequences from the given iterator.
            
            
    Parameters
                ----------
                sequences : Iterator
            
            
    Yields
                ------
                Bio.SeqIO.Seq :
                    Unique sequence records

Your ncov-simple’s subsampling.smk has a similar comment on line 140.

sacundim · February 4, 2022, 9:10am

Well, I’ve found a workaround that forces the execution of the rule combine_sequences_for_subsampling, and now my build is past augur tree where it has been failing.

It doesn’t look like it’d be hard for somebody who isn’t stumbling around this toolchain and codebase like I am to add a config parameter to switch the deduplication on and off. The combine_sequences_for_subsampling rule on the ncov/open/aligned.fasta.xz file took 33 minutes in an AWS Fargate X86 container with 4 vCPUs, for reference.

sacundim · February 9, 2022, 4:42am

After the work described in the GitHub ticket above, my job now succeeds without a workaround. Thanks guys!

Topic		Replies	Views
Error: Alignment must have at least 3 sequences Help and Getting Started	9	1054	November 7, 2024
ERROR: Problem reading in data/example_sequences.fasta: Duplicate key '2019-nCoV'	0	416	February 1, 2021
Iqtree error: Tree taxa and alignment sequence do not match	10	280	November 6, 2024
Diagnosing error + filtering issues Help and Getting Started	14	1622	November 9, 2020
Error message executing new tutorial Help and Getting Started	11	1624	July 16, 2020

Error in augur tree: "Duplicated sequence name"

Related topics