By default, Nextstrain’s pipeline to analyze SARS-CoV-2 uses
mafft with the options
--addfragments --keeplength to align all input sequences to the reference genome. For large data sets (>500k) this requires a lot of memory.
To keep running the pipeline on the entire data set, we now use
nextalign instead of
mafft. The reduces the memory requirements to a few Gb and is faster than
nextalign in your own workflow with the
ncov workflow, you need to set two variables in your
builds.yaml file (the config files used in the profiles):
use_nextalign: true genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]
In our profile, this is happening here:
By defining the genes, you instruct
nextalign to output translations of these genes along with the nucleotide alignment. These translations will be used to annotate the tree with amino acid changes since the default codon-by-codon translation of
augur has trouble with out-of-frame gaps. In addition,
nextalign will favor the opening of gaps at the beginning of a codon for the genes specified. This will result in gaps that respect codon boundaries when there is an ambiguity, i.e. if the gap can be shifted left or right without resulting in additional mismatches. The genes themselves are defined in the genome annotation. Our selection of genes above is motivated by the fact that they cover most of the genome and avoid overlapping ORFs, which would complicate the gap placement choice.
In addition to these changes in your workflow, you need to install
nextalign and there are several ways to get it
nextalignis included in the latest conda environment
nextalignis part of the docker image
- you can download and install it following these instructions. But make sure to copy the binary to a location that is in your
A quick note on what nextalign does and its limitations:
nextalignfirst matches seeds across the genome to infer rough placement and amount of indels
- then performs a banded alignment with an affine gap penalty (gap extension 0 by default) where the band width is estimated from the seed matches.
- extracts the nucleotide sequences corresponding to genes, translates them, and aligns the peptide sequence to the reference translation.
- the seed matching is currently not very robust and might fail if there is repeated sequence or if the sequences are too diverged. But this is not a problem for SARS-CoV-2 data.