Hello all,

I’m building a tree with global data from GISAID, but got an error during the tree building step:

Error in rule tree:

jobid: 5

output: results/spike_global/tree_raw.nwk

log: logs/tree_spike_global.txt (check log file(s) for error message)

shell:`augur tree --alignment results/spike_global/aligned.fasta --tree-builder-args '-ninit 10 -n 4' --output results/spike_global/tree_raw.nwk --nthreads 4 2>&1 | tee logs/tree_spike_global.txt`

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/tree_spike_global.txt:

Building a tree via:

iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/spike_global/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/spike_global/aligned-delim.iqtree.log

Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.

Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300ERROR: TREE BUILDING FAILED

Please see the log file for more details: results/spike_global/aligned-delim.iqtree.logBuilding original tree took 4101.029722690582 seconds

Removing output files of failed job tree since they might be corrupted:

results/spike_global/tree_raw.nwk

Shutting down, this might take some time.

Exiting because a job execution failed. Look above for error message

However when I look at the results/spike_global/aligned-delim.iqtree.log file, it looks like the tree was built:

Alignment was printed to results/spike_global/aligned-delim.fasta.uniqueseq.phy

For your convenience alignment with unique sequences printed to results/spike_global/aligned-delim.fasta.uniqueseq.phy

Create initial parsimony tree by phylogenetic likelihood library (PLL)… 28.085 seconds

NOTE: 392 MB RAM (0 GB) is required!

Estimate model parameters (epsilon = 0.500)

- Initial log-likelihood: -60288.255
- Current log-likelihood: -52884.715

Optimal log-likelihood: -52884.294

Rate parameters: A-C: 0.36331 A-G: 0.86950 A-T: 0.21617 C-G: 0.39216 C-T: 2.05273 G-T: 1.00000

Base frequencies: A: 0.279 C: 0.209 G: 0.206 T: 0.305

Parameters optimization took 2 rounds (70.383 sec)

Computing ML distances based on estimated model parameters… 2837.491 sec

Computing BIONJ tree…

1474.574 seconds

Log-likelihood of BIONJ tree: -52448.971

INITIALIZING CANDIDATE TREE SET Generating 8 parsimony trees… 224.761 second

Computing log-likelihood of 8 initial trees … 33.582 seconds

Current best score: -52448.971## Do NNI search on 2 best initial trees

Estimate model parameters (epsilon = 0.500)

BETTER TREE FOUND at iteration 1: -52310.505

Finish initializing candidate tree set (12)

Current best tree score: -52310.505 / CPU time: 395.445

Number of iterations: 2

OPTIMIZING CANDIDATE TREE SET TREE SEARCH COMPLETED AFTER 4 ITERATIONS / Time: 1h:23m:56s

FINALIZING TREE SEARCH Performs final model parameters optimization

Estimate model parameters (epsilon = 0.050)

- Initial log-likelihood: -52310.505

Optimal log-likelihood: -52310.497

Rate parameters: A-C: 0.36195 A-G: 0.89554 A-T: 0.20011 C-G: 0.33955 C-T: 2.09522 G-T: 1.00000

Base frequencies: A: 0.279 C: 0.209 G: 0.206 T: 0.305

Parameters optimization took 1 rounds (11.039 sec)

BEST SCORE FOUND : -52310.497

Total tree length: 1.322Total number of iterations: 4

CPU time used for tree search: 602.458 sec (0h:10m:2s)

Wall-clock time used for tree search: 605.088 sec (0h:10m:5s)

Total CPU time used: 5036.391 sec (1h:23m:56s)

Total wall-clock time used: 5048.890 sec (1h:24m:8s)Analysis results written to:

IQ-TREE report: results/spike_global/aligned-delim.fasta.iqtree

Maximum-likelihood tree: results/spike_global/aligned-delim.fasta.treefile

Likelihood distances: results/spike_global/aligned-delim.fasta.mldist

Screen log file: results/spike_global/aligned-delim.fasta.logDate and Time: Mon Apr 5 13:34:14 2021

There’s quite a bit of text above in the log file that I’m not sure is relevant, but it’s related to renaming problematic strains, e.g.

Lu’an_DELIM-QHMJMXEJNNOYGPIHOOFN_133_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_1069210 → Lu_an_DELIM-QHMJMXEJNNOYGPIHOOFN_133_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_1069210

…additional verbiage that follows this sort of regarding highly gapped seqs (for quite a few strains, just listing one here):

Gap/Ambiguity Composition p-value

1 Netherlands_DELIM-QHMJMXEJNNOYGPIHOOFN_Oss_1363500_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_41358187.23% failed 0.00%

…and finally text that lists duplicate strains (just one listed here for brevity):

NOTE: Netherlands_DELIM-QHMJMXEJNNOYGPIHOOFN_Tilburg_1363354_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_413586 is identical to Netherlands_DELIM-QHMJMXEJNNOYGPIHOOFN_Oss_1363500_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_413581 but kept for subsequent analysis

The warnings related to gaps are expected; I’m trying to create a spike-specific build that works by masking the majority of the sequence. The resulting .phy alignment file looks fine, and the `.iqtree`

file looks like it contains tree info, including in newick format. But I’m not sure why the snakemake rule fails. Any ideas? Thanks!