Avoiding large memory requirements by switching to `nextalign`

By default, Nextstrain’s pipeline to analyze SARS-CoV-2 uses mafft with the options --addfragments --keeplength to align all input sequences to the reference genome. For large data sets (>500k) this requires a lot of memory.

To keep running the pipeline on the entire data set, we now use nextalign instead of mafft. The reduces the memory requirements to a few Gb and is faster than mafft.

To use nextalign in your own workflow with the ncov workflow, you need to set two variables in your parameters.yaml or builds.yaml file (the config files used in the profiles):

use_nextalign: true
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]

In our profile, this is happening here:

By defining the genes, you instruct nextalign to output translations of these genes along with the nucleotide alignment. These translations will be used to annotate the tree with amino acid changes since the default codon-by-codon translation of augur has trouble with out-of-frame gaps. In addition, nextalign will favor the opening of gaps at the beginning of a codon for the genes specified. This will result in gaps that respect codon boundaries when there is an ambiguity, i.e. if the gap can be shifted left or right without resulting in additional mismatches. The genes themselves are defined in the genome annotation. Our selection of genes above is motivated by the fact that they cover most of the genome and avoid overlapping ORFs, which would complicate the gap placement choice.

In addition to these changes in your workflow, you need to install nextalign and there are several ways to get it

  • nextalign is included in the latest conda environment
  • nextalign is part of the docker image
  • you can download and install it following these instructions. But make sure to copy the binary to a location that is in your PATH

A quick note on what nextalign does and its limitations:

  • nextalign first matches seeds across the genome to infer rough placement and amount of indels
  • then performs a banded alignment with an affine gap penalty (gap extension 0 by default) where the band width is estimated from the seed matches.
  • extracts the nucleotide sequences corresponding to genes, translates them, and aligns the peptide sequence to the reference translation.
  • the seed matching is currently not very robust and might fail if there is repeated sequence or if the sequences are too diverged. But this is not a problem for SARS-CoV-2 data.