Error in augur tree: "Duplicated sequence name"

Hi there! I’m running a build that I’ve more or less copy-pasted and adapted from ncov/open (= GenBank), using the prebuilt intermediate aligned.fasta.xz file. My build.yaml is here:

This build ran fine on January 23, and I haven’t made any changes to my repo since then, but I nevertheless get a failure during the augur tree step now (see below). I also tried applying the latest commits from the upstream ncov repo master branch since my build (I was up to commit 983f7953), but it didn’t help either.

Some of the log output from my nextstrain build command. There’s of couse a lot more, I’ve tried a bit blindly to guess the most obviously relevant bits, if there’s something else I need to look for I can do it:

[batch]         augur tree             --alignment results/puerto-rico/filtered.fasta             --tree-builder-args '-ninit 10 -n 4'             --exclude-sites defaults/sites_ignored_for_tree_topology.txt             --output results/puerto-rico/tree_raw.nwk             --nthreads 4 2>&1 | tee logs/tree_puerto-rico.txt
[batch]         
[batch]         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] Logfile logs/tree_puerto-rico.txt:
[batch] ERROR: Shell exited 2 when running: iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4 > results/puerto-rico/masked_filtered-delim.iqtree.log
[batch] Command output was:
[batch]   ERROR: Please rename sequences listed above!
[batch] 7 masking sites read from defaults/sites_ignored_for_tree_topology.txt
[batch] Building a tree via:
[batch]         iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4 > results/puerto-rico/masked_filtered-delim.iqtree.log
[batch]         Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
[batch]         Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300
[batch] ERROR: TREE BUILDING FAILED
[batch] Please see the log file for more details: results/puerto-rico/masked_filtered-delim.iqtree.log

And here’s from the masked_filtered-delim.iqtree.log file, it goes on for some 45 more lines with more “Duplicated sequence name” errors:

IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Oct 22 2020
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    5138927a076046b4919b3790bfc7761b-2470140894 (AVX512, FMA3, 15 GB RAM)
Command: iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4
Seed:    282218 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Mon Jan 31 07:39:26 2022
Kernel:  AVX+FMA - 4 threads (4 CPU cores detected)

Reading alignment file results/puerto-rico/masked_filtered-delim.fasta ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
WARNING: 157 sites contain only gaps or ambiguous characters.
Alignment has 4349 sequences with 29903 columns, 20404 distinct patterns
3683 parsimony-informative, 3733 singleton sites, 22487 constant sites
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_AZ-CDC-LC0471031_DELIM-MSFKQCUMHEHDTGBYGIOI_2021
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_CT-CDC-LC0465544_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_CT-CDC-LC0467878_DELIM-MSFKQCUMHEHDTGBYGIOI_2021
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DC-CDC-LC0462811_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DC-CDC-LC0464641_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DE-CDC-LC0461229_DELIM-MSFKQCUMHEHDTGBYGIOI_2022

Hi @sacundim, welcome! The errors you pulled out of the logs are probably the issue here. The first error contains:

ERROR: Please rename sequences listed above!

which I believe refers to the ERROR: Duplicated sequence name … errors in the other log file.

This build ran fine on January 23, and I haven’t made any changes to my repo since then, but I nevertheless get a failure during the augur tree step now

I suspect you’re seeing these errors now because the input data changed (and now includes duplicate sequence names) or that the current duplicates were somehow not included in previous run due to sampling.

The input data changes every day indeed, but the thing is that my input data is this:

inputs:
  - name: "open"
    metadata: "https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz"
    aligned: "https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz"
    skip_sanitize_metadata: true

…which is the dataset as the canonical ncov/open builds, e.g. this one:

…as documented here:

So I don’t believe it’s the input data, it’s gotta be something with my own build’s configuration. (Which I mostly copy-pasted and modified from the canonical ones, but obviously I’ve got something wrong when I modified.)

I looked into this a bit more. There are actual duplicate sequence names, the ones you’re running into, in a copy of https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz that I downloaded. One example:

$ xzcat -T4 < aligned.fasta.xz | grep '^>' | grep -nF DE-CDC-LC0461229
3454031:>USA/DE-CDC-LC0461229/2022
3469494:>USA/DE-CDC-LC0461229/2022

$ sha256sum aligned.fasta.xz 
7137341cdd75befc5d36eb3fade7fcb6c00ea077ac5ce383b66d1e9b72b98cac  aligned.fasta.xz

If those sequences are selected during subsampling, then augur filter will dutifully pass all copies through. I tested this because I wasn’t sure what augur filter would do:

$ cat tmp.fasta
>one
A
>two
T
>three
C
>one
G
>four
N

$ cat meta.csv 
strain,
one,
two,
three,
four,

$ augur filter --sequences tmp.fasta --metadata meta.csv --exclude-all --include <(echo one; echo two) --output-sequences out.fasta
2 strains were dropped during filtering
        4 of these were dropped by `--exclude-all`
        2 strains were added back because they were in /dev/fd/63
2 strains passed all filters

$ cat out.fasta 
>one
A
>two
T
>one
G

So the issue is in the upstream data, but the workflow could maybe more gracefully handle this as well.

1 Like

I’ve opened an issue in our ncov-ingest repo, which is what produces the aligned.fasta.xz used above.

1 Like

Do we not use a seq dedup script in ncov?

We use this in ncov-simple and it works well, should be a workaround for OP until we clean up the open sequences.

Looks to me like the ncov workflow has similar code but it’s only used on the multiple inputs code path, if the comment in line 216 isn’t lying:

The function that looks relevant in sanitize_sequences.py:

Your ncov-simple’s subsampling.smk has a similar comment on line 140.

Well, I’ve found a workaround that forces the execution of the rule combine_sequences_for_subsampling, and now my build is past augur tree where it has been failing.

It doesn’t look like it’d be hard for somebody who isn’t stumbling around this toolchain and codebase like I am to add a config parameter to switch the deduplication on and off. The combine_sequences_for_subsampling rule on the ncov/open/aligned.fasta.xz file took 33 minutes in an AWS Fargate X86 container with 4 vCPUs, for reference.

1 Like

After the work described in the GitHub ticket above, my job now succeeds without a workaround. Thanks guys!

1 Like