Augur alignment failing - problem with mafft

Hi! I am trying to make a build on Nextstrain by starting with the GISAID provided nextmeta and nextfasta files, i.e., with the whole ~300k sequences available. I successfully ran getting_started datasets on my own laptop and in cluster, but I don’t seem to be able to get the workflow for my own build flowing. In both instances, it gets stuck in alignment phase with mafft.

On laptop, the error is

Error in rule align:
jobid: 24
output: results/aligned.fasta
log: logs/align.txt (check log file(s) for error message)
shell:

    mafft             --auto             --thread 2             --keeplength             --addfragments             results/prefiltered.fasta             defaults/reference_seq.fasta > results/aligned.fasta 2> logs/align.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/align.txt:
nadd = 287170
npair = 287170
nseq = 287171
nlen = 59753
use ktuples, size=6!
nadd = 287170
ppenalty_ex = -10
nthread = 2
blosum 62 / kimura 200
sueff_global = 0.100000
norg = 1
njobc = 2
Cannot allocate 239013 character vector.

While in cluster:

[Mon Dec 28 16:40:19 2020]
Error in rule align:
jobid: 28
output: results/aligned.fasta
log: logs/align.txt (check log file(s) for error message)
shell:

    mafft             --auto             --thread 12             --keeplength             --addfragments             results/prefiltered.fasta             defaults/reference_seq.fasta > results/aligned.fasta 2> logs/align.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/align.txt:
grep: write error
cat: write error: No space left on device
bioconda3_env/nextstrain/bin/mafft: line 1211: [: -eq: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1229: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1234: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1239: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1244: [: -lt: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1249: [: -lt: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1256: [: -lt: unary operator expected
bioconda3_env/nextstrain/bin/mafft: line 1263: [: -lt: unary operator expected
expr: syntax error
bioconda3_env/nextstrain/bin/mafft: line 1323: [: -: integer expression expected
bioconda3_env/nextstrain/bin/mafft: line 1331: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1334: [: too many arguments
bioconda3_env/nextstrain/bin/mafft: line 1975: [: -gt: unary operator expected

The --keeplength and --mapout options are supported
only with --add, --addfragments or --addlong.

Removing output files of failed job align since they might be corrupted:
results/aligned.fasta
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /ncov/.snakemake/log/2020-12-28T164016.142116.snakemake.log

The error on my laptop suggets that the larger datasets leads to very long alignment (nlen=59752), which sounds problematic. Would that be the cause of the failure in both instances?

Looking at mafft, it seems that this is a common problem with mafft (MAFFT - a multiple sequence alignment program) - is there a simple way to solve this issue?

I am running into the same problem. Did you get it resolved?

Hi @tuomas, welcome. Apologies that this post slipped our notice (and thanks for the bump, @AlexS).

On your laptop, the output shows the signs of an out-of-memory condition. One clue in particular stands out from experience: Cannot allocate [N] character vector in the log file. The full builds require much more memory than is available on a typical laptop.

On your cluster, the output more clearly points at an out-of-disk-space condition with these lines in the log file:

grep: write error
cat: write error: No space left on device

The builds need quite a bit of disk space while running (although not a ridiculous amount).

I’d suggest checking what resources are available on your cluster and if you can increase the disk space available to your job.

For reference, we run our daily global + regional builds with 96 CPU cores, 180 GiB of RAM, and ~200 GiB of disk available. You don’t strictly need this much if you don’t parallelize the builds as aggressively or if you further subset the data involved.

1 Like

Hi, @trs.
I am having a similar problem:

> Error in rule align:
>     jobid: 22
>     output: results/aligned.fasta
>     log: logs/align.txt (check log file(s) for error message)
>     shell:
>         
>         mafft             --auto             --thread 10             --keeplength             --addfragments             results/prefiltered.fasta             defaults/reference_seq.fasta > results/aligned.fasta 2> logs/align.txt
>         
>         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
> Logfile logs/align.txt:
> nadd =  424033
> npair =  424033
> nseq =  424034
> nlen =  34692
> use ktuples, size=6!
> nadd = 424033
> ppenalty_ex = -10
> nthread = 10
> blosum 62 / kimura 200
> sueff_global = 0.100000
> norg = 1
> njobc = 2
> generating a scoring matrix for nucleotide (dist=200) ... done
> 
> 
> Making a distance matrix ..
> /home/fernando_hayashi/miniconda3/envs/nextstrain/bin/mafft: line 2747: 35443 Killed                  "$prefix/addsingle" -Q 100 $legacygapopt -W $tuplesize -O $outnum $addsinglearg $addarg $add2ndhalfarg -C $numthreads $memopt $weightopt $treeinopt $treeoutopt $distoutopt $seqtype $model -f "-"$gop -h $aof $param_fft $localparam $algopt $treealg $scoreoutarg < infile > /dev/null 2>> "$progressfile"
> 
> 
> Removing output files of failed job align since they might be corrupted:
> results/aligned.fasta
> Shutting down, this might take some time.
> Exiting because a job execution failed. Look above for error message
> Complete log: /home/fernando_hayashi/Documentos/sars-cov-2/ncov/.snakemake/log/2021-01-29T092954.195478.snakemake.log

Do you think it is a RAM problem? I am running my analysis on a Intel® Xeon® W-2235 CPU @ 3.80GHz × 12 (64GB of RAM). Because of this problem I asked about the minimum specs for using nextstrain, and you answered that my computer was fine for it.

The error you’re seeing is almost certainly because MAFFT ran out of memory. The mafft: line 2747: 35443 Killed line is the clue.

I may have been mistaken when I said that 64GB of RAM was enough for the full GISAID dataset (~424k seqs in your case above). Looking at the benchmarks for our production builds:

image

I see that while the 7 instances of the “tree” and “refine” steps sum to less than 64 GB of RAM, the shared “align” step uses more than 120 GB all by itself.

You could further subset the data before aligning to fit within 64 GB, or possibly adjust the memory strategy MAFFT uses. augur align uses MAFFT’s --nomemsave option, which significantly increases memory requirements but makes the alignment much faster. Disabling this might help you fit the alignment in 64 GB. We also have another aligner in the works which I believe would reduce memory reqs, but not sure when that’ll be ready.

1 Like

we have now released our reference aligner nextalign that uses much less memory.

You can download it in the download section here:

The ncov workflow has been updated to use this, but it is currently an optin feature. See here for the rule:

I wrote a quick summary on how to switch to nextalign: