Hi there! I’m running a build that I’ve more or less copy-pasted and adapted from ncov/open (= GenBank), using the prebuilt intermediate aligned.fasta.xz file. My build.yaml is here:
This build ran fine on January 23, and I haven’t made any changes to my repo since then, but I nevertheless get a failure during the augur tree step now (see below). I also tried applying the latest commits from the upstream ncov repo master branch since my build (I was up to commit 983f7953), but it didn’t help either.
Some of the log output from my nextstrain build command. There’s of couse a lot more, I’ve tried a bit blindly to guess the most obviously relevant bits, if there’s something else I need to look for I can do it:
[batch] augur tree --alignment results/puerto-rico/filtered.fasta --tree-builder-args '-ninit 10 -n 4' --exclude-sites defaults/sites_ignored_for_tree_topology.txt --output results/puerto-rico/tree_raw.nwk --nthreads 4 2>&1 | tee logs/tree_puerto-rico.txt
[batch]
[batch] (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[batch] Logfile logs/tree_puerto-rico.txt:
[batch] ERROR: Shell exited 2 when running: iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4 > results/puerto-rico/masked_filtered-delim.iqtree.log
[batch] Command output was:
[batch] ERROR: Please rename sequences listed above!
[batch] 7 masking sites read from defaults/sites_ignored_for_tree_topology.txt
[batch] Building a tree via:
[batch] iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4 > results/puerto-rico/masked_filtered-delim.iqtree.log
[batch] Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
[batch] Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300
[batch] ERROR: TREE BUILDING FAILED
[batch] Please see the log file for more details: results/puerto-rico/masked_filtered-delim.iqtree.log
And here’s from the masked_filtered-delim.iqtree.log file, it goes on for some 45 more lines with more “Duplicated sequence name” errors:
IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Oct 22 2020
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.
Host: 5138927a076046b4919b3790bfc7761b-2470140894 (AVX512, FMA3, 15 GB RAM)
Command: iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/puerto-rico/masked_filtered-delim.fasta -m GTR -ninit 10 -n 4
Seed: 282218 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Mon Jan 31 07:39:26 2022
Kernel: AVX+FMA - 4 threads (4 CPU cores detected)
Reading alignment file results/puerto-rico/masked_filtered-delim.fasta ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
WARNING: 157 sites contain only gaps or ambiguous characters.
Alignment has 4349 sequences with 29903 columns, 20404 distinct patterns
3683 parsimony-informative, 3733 singleton sites, 22487 constant sites
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_AZ-CDC-LC0471031_DELIM-MSFKQCUMHEHDTGBYGIOI_2021
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_CT-CDC-LC0465544_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_CT-CDC-LC0467878_DELIM-MSFKQCUMHEHDTGBYGIOI_2021
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DC-CDC-LC0462811_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DC-CDC-LC0464641_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
ERROR: Duplicated sequence name USA_DELIM-MSFKQCUMHEHDTGBYGIOI_DE-CDC-LC0461229_DELIM-MSFKQCUMHEHDTGBYGIOI_2022
Hi @sacundim, welcome! The errors you pulled out of the logs are probably the issue here. The first error contains:
ERROR: Please rename sequences listed above!
which I believe refers to the ERROR: Duplicated sequence name … errors in the other log file.
This build ran fine on January 23, and I haven’t made any changes to my repo since then, but I nevertheless get a failure during the augur tree step now
I suspect you’re seeing these errors now because the input data changed (and now includes duplicate sequence names) or that the current duplicates were somehow not included in previous run due to sampling.
So I don’t believe it’s the input data, it’s gotta be something with my own build’s configuration. (Which I mostly copy-pasted and modified from the canonical ones, but obviously I’ve got something wrong when I modified.)
If those sequences are selected during subsampling, then augur filter will dutifully pass all copies through. I tested this because I wasn’t sure what augur filter would do:
$ cat tmp.fasta
>one
A
>two
T
>three
C
>one
G
>four
N
$ cat meta.csv
strain,
one,
two,
three,
four,
$ augur filter --sequences tmp.fasta --metadata meta.csv --exclude-all --include <(echo one; echo two) --output-sequences out.fasta
2 strains were dropped during filtering
4 of these were dropped by `--exclude-all`
2 strains were added back because they were in /dev/fd/63
2 strains passed all filters
$ cat out.fasta
>one
A
>two
T
>one
G
So the issue is in the upstream data, but the workflow could maybe more gracefully handle this as well.
Well, I’ve found a workaround that forces the execution of the rule combine_sequences_for_subsampling, and now my build is past augur tree where it has been failing.
It doesn’t look like it’d be hard for somebody who isn’t stumbling around this toolchain and codebase like I am to add a config parameter to switch the deduplication on and off. The combine_sequences_for_subsampling rule on the ncov/open/aligned.fasta.xz file took 33 minutes in an AWS Fargate X86 container with 4 vCPUs, for reference.