By default, Nextstrain’s pipeline to analyze SARS-CoV-2 uses mafft
with the options --addfragments --keeplength
to align all input sequences to the reference genome. For large data sets (>500k) this requires a lot of memory.
To keep running the pipeline on the entire data set, we now use nextalign
instead of mafft
. The reduces the memory requirements to a few Gb and is faster than mafft
.
To use nextalign
in your own workflow with the ncov
workflow, you need to set two variables in your parameters.yaml
or builds.yaml
file (the config files used in the profiles):
use_nextalign: true
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]
In our profile, this is happening here:
nextstrain_profiles/nextstrain/builds.yaml
By defining the genes, you instruct nextalign
to output translations of these genes along with the nucleotide alignment. These translations will be used to annotate the tree with amino acid changes since the default codon-by-codon translation of augur
has trouble with out-of-frame gaps. In addition, nextalign
will favor the opening of gaps at the beginning of a codon for the genes specified. This will result in gaps that respect codon boundaries when there is an ambiguity, i.e. if the gap can be shifted left or right without resulting in additional mismatches. The genes themselves are defined in the genome annotation. Our selection of genes above is motivated by the fact that they cover most of the genome and avoid overlapping ORFs, which would complicate the gap placement choice.
In addition to these changes in your workflow, you need to install nextalign
and there are several ways to get it
-
nextalign
is included in the latest conda environment -
nextalign
is part of the docker image - you can download and install it following these instructions. But make sure to copy the binary to a location that is in your
PATH
A quick note on what nextalign does and its limitations:
-
nextalign
first matches seeds across the genome to infer rough placement and amount of indels - then performs a banded alignment with an affine gap penalty (gap extension 0 by default) where the band width is estimated from the seed matches.
- extracts the nucleotide sequences corresponding to genes, translates them, and aligns the peptide sequence to the reference translation.
- the seed matching is currently not very robust and might fail if there is repeated sequence or if the sequences are too diverged. But this is not a problem for SARS-CoV-2 data.