Hello all,
I’m building a tree with global data from GISAID, but got an error during the tree building step:
Error in rule tree:
jobid: 5
output: results/spike_global/tree_raw.nwk
log: logs/tree_spike_global.txt (check log file(s) for error message)
shell:augur tree --alignment results/spike_global/aligned.fasta --tree-builder-args '-ninit 10 -n 4' --output results/spike_global/tree_raw.nwk --nthreads 4 2>&1 | tee logs/tree_spike_global.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/tree_spike_global.txt:
Building a tree via:
iqtree -ninit 2 -n 2 -me 0.05 -nt 4 -s results/spike_global/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/spike_global/aligned-delim.iqtree.log
Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
Mol. Biol. Evol., 32:268-274. IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies | Molecular Biology and Evolution | Oxford AcademicERROR: TREE BUILDING FAILED
Please see the log file for more details: results/spike_global/aligned-delim.iqtree.logBuilding original tree took 4101.029722690582 seconds
Removing output files of failed job tree since they might be corrupted:
results/spike_global/tree_raw.nwk
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
However when I look at the results/spike_global/aligned-delim.iqtree.log file, it looks like the tree was built:
Alignment was printed to results/spike_global/aligned-delim.fasta.uniqueseq.phy
For your convenience alignment with unique sequences printed to results/spike_global/aligned-delim.fasta.uniqueseq.phy
Create initial parsimony tree by phylogenetic likelihood library (PLL)… 28.085 seconds
NOTE: 392 MB RAM (0 GB) is required!
Estimate model parameters (epsilon = 0.500)
- Initial log-likelihood: -60288.255
- Current log-likelihood: -52884.715
Optimal log-likelihood: -52884.294
Rate parameters: A-C: 0.36331 A-G: 0.86950 A-T: 0.21617 C-G: 0.39216 C-T: 2.05273 G-T: 1.00000
Base frequencies: A: 0.279 C: 0.209 G: 0.206 T: 0.305
Parameters optimization took 2 rounds (70.383 sec)
Computing ML distances based on estimated model parameters… 2837.491 sec
Computing BIONJ tree…
1474.574 seconds
Log-likelihood of BIONJ tree: -52448.971
INITIALIZING CANDIDATE TREE SET Generating 8 parsimony trees… 224.761 second Computing log-likelihood of 8 initial trees … 33.582 seconds Current best score: -52448.971 Do NNI search on 2 best initial trees
Estimate model parameters (epsilon = 0.500)
BETTER TREE FOUND at iteration 1: -52310.505
Finish initializing candidate tree set (12)
Current best tree score: -52310.505 / CPU time: 395.445
Number of iterations: 2
OPTIMIZING CANDIDATE TREE SET TREE SEARCH COMPLETED AFTER 4 ITERATIONS / Time: 1h:23m:56s
FINALIZING TREE SEARCH Performs final model parameters optimization Estimate model parameters (epsilon = 0.050)
- Initial log-likelihood: -52310.505
Optimal log-likelihood: -52310.497
Rate parameters: A-C: 0.36195 A-G: 0.89554 A-T: 0.20011 C-G: 0.33955 C-T: 2.09522 G-T: 1.00000
Base frequencies: A: 0.279 C: 0.209 G: 0.206 T: 0.305
Parameters optimization took 1 rounds (11.039 sec)
BEST SCORE FOUND : -52310.497
Total tree length: 1.322Total number of iterations: 4
CPU time used for tree search: 602.458 sec (0h:10m:2s)
Wall-clock time used for tree search: 605.088 sec (0h:10m:5s)
Total CPU time used: 5036.391 sec (1h:23m:56s)
Total wall-clock time used: 5048.890 sec (1h:24m:8s)Analysis results written to:
IQ-TREE report: results/spike_global/aligned-delim.fasta.iqtree
Maximum-likelihood tree: results/spike_global/aligned-delim.fasta.treefile
Likelihood distances: results/spike_global/aligned-delim.fasta.mldist
Screen log file: results/spike_global/aligned-delim.fasta.logDate and Time: Mon Apr 5 13:34:14 2021
There’s quite a bit of text above in the log file that I’m not sure is relevant, but it’s related to renaming problematic strains, e.g.
Lu’an_DELIM-QHMJMXEJNNOYGPIHOOFN_133_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_1069210 → Lu_an_DELIM-QHMJMXEJNNOYGPIHOOFN_133_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_1069210
…additional verbiage that follows this sort of regarding highly gapped seqs (for quite a few strains, just listing one here):
Gap/Ambiguity Composition p-value
1 Netherlands_DELIM-QHMJMXEJNNOYGPIHOOFN_Oss_1363500_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_41358187.23% failed 0.00%
…and finally text that lists duplicate strains (just one listed here for brevity):
NOTE: Netherlands_DELIM-QHMJMXEJNNOYGPIHOOFN_Tilburg_1363354_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_413586 is identical to Netherlands_DELIM-QHMJMXEJNNOYGPIHOOFN_Oss_1363500_DELIM-QHMJMXEJNNOYGPIHOOFN_2020_DELIM-QHMJMXEJNNOYGPIHOOFN_EPI_ISL_413581 but kept for subsequent analysis
The warnings related to gaps are expected; I’m trying to create a spike-specific build that works by masking the majority of the sequence. The resulting .phy alignment file looks fine, and the .iqtree
file looks like it contains tree info, including in newick format. But I’m not sure why the snakemake rule fails. Any ideas? Thanks!