Hi! I am trying to make a build on Nextstrain by starting with the GISAID provided nextmeta and nextfasta files, i.e., with the whole ~300k sequences available. I successfully ran getting_started datasets on my own laptop and in cluster, but I don’t seem to be able to get the workflow for my own build flowing. In both instances, it gets stuck in alignment phase with mafft.
On laptop, the error is
Error in rule align:
jobid: 24
output: results/aligned.fasta
log: logs/align.txt (check log file(s) for error message)
shell:
mafft --auto --thread 2 --keeplength --addfragments results/prefiltered.fasta defaults/reference_seq.fasta > results/aligned.fasta 2> logs/align.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[Mon Dec 28 16:40:19 2020]
Error in rule align:
jobid: 28
output: results/aligned.fasta
log: logs/align.txt (check log file(s) for error message)
shell:
mafft --auto --thread 12 --keeplength --addfragments results/prefiltered.fasta defaults/reference_seq.fasta > results/aligned.fasta 2> logs/align.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/align.txt:
grep: write error
cat: write error: No space left on device
bioconda3_env/nextstrain/bin/mafft: line 1211: [: -eq: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1229: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1234: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1239: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1244: [: -lt: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1249: [: -lt: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1256: [: -lt: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1263: [: -lt: unary operator expected
expr: syntax error
bioconda3_env/nextstrain/bin/mafft: line 1323: [: -: integer expression expected
bioconda3_env/nextstrain/bin/mafft: line 1331: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1334: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1975: [: -gt: unary operator expected
The --keeplength and --mapout options are supported
only with --add, --addfragments or --addlong.
Removing output files of failed job align since they might be corrupted:
results/aligned.fasta
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ncov/.snakemake/log/2020-12-28T164016.142116.snakemake.log
The error on my laptop suggets that the larger datasets leads to very long alignment (nlen=59752), which sounds problematic. Would that be the cause of the failure in both instances?
Hi @tuomas, welcome. Apologies that this post slipped our notice (and thanks for the bump, @AlexS).
On your laptop, the output shows the signs of an out-of-memory condition. One clue in particular stands out from experience: Cannot allocate [N] character vector in the log file. The full builds require much more memory than is available on a typical laptop.
On your cluster, the output more clearly points at an out-of-disk-space condition with these lines in the log file:
grep: write error
cat: write error: No space left on device
The builds need quite a bit of disk space while running (although not a ridiculous amount).
I’d suggest checking what resources are available on your cluster and if you can increase the disk space available to your job.
For reference, we run our daily global + regional builds with 96 CPU cores, 180 GiB of RAM, and ~200 GiB of disk available. You don’t strictly need this much if you don’t parallelize the builds as aggressively or if you further subset the data involved.
> Error in rule align:
> jobid: 22
> output: results/aligned.fasta
> log: logs/align.txt (check log file(s) for error message)
> shell:
>
> mafft --auto --thread 10 --keeplength --addfragments results/prefiltered.fasta defaults/reference_seq.fasta > results/aligned.fasta 2> logs/align.txt
>
> (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
> Logfile logs/align.txt:
> nadd = 424033
> npair = 424033
> nseq = 424034
> nlen = 34692
> use ktuples, size=6!
> nadd = 424033
> ppenalty_ex = -10
> nthread = 10
> blosum 62 / kimura 200
> sueff_global = 0.100000
> norg = 1
> njobc = 2
> generating a scoring matrix for nucleotide (dist=200) ... done
>
>
> Making a distance matrix ..
> /home/fernando_hayashi/miniconda3/envs/nextstrain/bin/mafft: line 2747: 35443 Killed "$prefix/addsingle" -Q 100 $legacygapopt -W $tuplesize -O $outnum $addsinglearg $addarg $add2ndhalfarg -C $numthreads $memopt $weightopt $treeinopt $treeoutopt $distoutopt $seqtype $model -f "-"$gop -h $aof $param_fft $localparam $algopt $treealg $scoreoutarg < infile > /dev/null 2>> "$progressfile"
>
>
> Removing output files of failed job align since they might be corrupted:
> results/aligned.fasta
> Shutting down, this might take some time.
> Exiting because a job execution failed. Look above for error message
> Complete log: /home/fernando_hayashi/Documentos/sars-cov-2/ncov/.snakemake/log/2021-01-29T092954.195478.snakemake.log
Do you think it is a RAM problem? I am running my analysis on a Intel® Xeon® W-2235 CPU @ 3.80GHz × 12 (64GB of RAM). Because of this problem I asked about the minimum specs for using nextstrain, and you answered that my computer was fine for it.
The error you’re seeing is almost certainly because MAFFT ran out of memory. The mafft: line 2747: 35443 Killed line is the clue.
I may have been mistaken when I said that 64GB of RAM was enough for the full GISAID dataset (~424k seqs in your case above). Looking at the benchmarks for our production builds:
I see that while the 7 instances of the “tree” and “refine” steps sum to less than 64 GB of RAM, the shared “align” step uses more than 120 GB all by itself.
You could further subset the data before aligning to fit within 64 GB, or possibly adjust the memory strategy MAFFT uses. augur align uses MAFFT’s --nomemsave option, which significantly increases memory requirements but makes the alignment much faster. Disabling this might help you fit the alignment in 64 GB. We also have another aligner in the works which I believe would reduce memory reqs, but not sure when that’ll be ready.