Alignments in Nextstrain

NyxMoiren · December 15, 2023, 11:20am

Hello!

We try to build our custom nextstrain build using bacterial genomes of 5.2MB per size.

Currently, we use a vcf file generated by parsnp from harvesttools for the alignment ( for 800 genomes, its usually finished in 4 hrs on a 200 GB RAM machine with 80 cores available, but we used 10-20 normally).
The problem is, that when adding newer genomes, we have to run the entire dataset from scratch.

augur align should be able to add genomes to an existing alignment (which in turn, would suite for cumulative data analysis).
When testing augur align with 4 genomes (and planning to add a fifth to the alignment) we ran out of memory in the initial alignment of the 4 genomes (or so we think).

When adding a genome to an existing alignment from parsnp we had to stop the process as soon as we hit our RAM limit of 200 GB.

We saw the forum post Link regarding memory issues and the possible solution of using augur nextalign, however as the post already stated, the “seed matching is currently not very robust and might fail if there is repeated sequence or if the sequences are too diverged.” which is the case for our data set.

This is the code we have used for running our alignment file using only augur:

augur align -s input.dir -o output.fasta --nthreads 70 --reference-sequence ref.fasta

Do we have a problem with our code?
Is there a possibility to limit the ram usage of the augur align command?

corneliusroemer · December 15, 2023, 3:25pm

Hi! Thanks for reaching out with such a well thought through question.

As a caveat, I don’t know much about bacterial genomics, so my answer will come from the perspective of someone treating the genome essentially as a huge virus that’s 20 times longer than mpox virus.

Part 1 - How to add to an existing alignment without having to realign previous already aligned genomes

Currently, we use a vcf file generated by parsnp from harvesttools for the alignment ( for 800 genomes, its usually finished in 4 hrs on a 200 GB RAM machine with 80 cores available, but we used 10-20 normally).
The problem is, that when adding newer genomes, we have to run the entire dataset from scratch.

Do I understand correctly that this part does not (yet) involve augur?

If I understand it correctly, you should be able to individually align each sequence against the same reference, outputting an aligned fasta (with insertions relative to reference stripped) for each (query) sequence. Then you append all the individual fastas together into one big fasta. This fasta can then be incrementally added to with new sequences without having to realign the rest.

You can easily create a VCF from this big fasta with a tool like Ucsc Fatovcf :: Anaconda.org

This should in theory solve the incremental alignment problem independent of the tool used for (reference) alignment.

If you do a proper multiple sequence alignment with MAFFT, this won’t work as that way you cannot strip insertions. I don’t have experience beyond simple reference alignments (yet) so can’t help here unfortunately.

Part 2 - Running out of memory

augur align should be able to add genomes to an existing alignment (which in turn, would suite for cumulative data analysis).
When testing augur align with 4 genomes (and planning to add a fifth to the alignment) we ran out of memory in the initial alignment of the 4 genomes (or so we think).

When adding a genome to an existing alignment from parsnp we had to stop the process as soon as we hit our RAM limit of 200 GB.

I can’t speak about parnsp, but augur align is just a wrapper around mafft. So it does a “proper” multiple sequence alignment which is much more complex than a simple one-by-one query vs reference alignment. MAFFT just can’t handle such big genomes I presume - but I haven’t used MAFFT much so I don’t know what’s going on here.

As a reminder, I don’t know much about bacteria and there might be other tools that work for that situation. Augur is primarily built with for smaller virus genomes, so the kind of things you’re trying to do aren’t as well tested and augur might not be able to do everything you’d want it to do.

Part 3 - Using Nextclade

In contrast to bacteria, this is my core area of expertise We’ve overhauled seed matching in version 3 of Nextclade which is currently in alpha release (so you can already use it, we’ll properly release it within a month or so) and already used by us in production for various workflows. You can download it here: Release 3.0.0-alpha.1 · nextstrain/nextclade · GitHub

So the sentence ““seed matching is currently not very robust and might fail if there is repeated sequence or if the sequences are too diverged.”” will probably no longer apply, i.e. it should be possible to use Nextclade to align your sequences against a reference. One might need to tune alignment/seed matching parameters however. I’m happy to help if you send me a few example sequences and a reference to align to.

Beware though that what Nextclade does is a “simple” one-by-one 1 query to 1 reference alignment which is means that insertions are stripped and information between query sequences is ignored. This makes the problem computationally much simpler, faster, and require less memory, but it comes at a loss of information that one needs to be aware of.

Documentation for v3 can already be looked at here: Nextclade: analysis of viral genetic sequences — Nextclade documentation

You might need to tweak the following parameters to get a fast and robust alignment that doesn’t take up too much memory:

--kmer-length <KMER_LENGTH> — Length of exactly matching k-mers used in the seed alignment of the query to the reference
--kmer-distance <KMER_DISTANCE> — Interval of successive k-mers on the query sequence. Should be small compared to the query length
--allowed-mismatches <ALLOWED_MISMATCHES> — Exactly matching k-mers are extended to the left and right until more than allowed_mismatches are observed in a sliding window (window_size)
--window-size <WINDOW_SIZE> — Size of the window within which mismatches are accumulated during seed extension
--min-match-length <MIN_MATCH_LENGTH> — Minimum length of extended k-mers
--min-seed-cover <MIN_SEED_COVER> — Fraction of the query sequence that has to be covered by extended seeds to proceed with the banded alignment
--max-alignment-attempts <MAX_ALIGNMENT_ATTEMPTS> — Number of times Nextclade will retry alignment with more relaxed results if alignment band boundaries are hit
--excess-bandwidth <EXCESS_BANDWIDTH> — Excess bandwidth for internal stripes
--terminal-bandwidth <TERMINAL_BANDWIDTH> — Excess bandwidth for terminal stripes
--max-band-area <MAX_BAND_AREA> — Maximum area of the band in the alignment matrix. Alignments with large bands are slow to compute and require substantial memory. Alignment of sequences requiring bands with area larger than this value, will not be attempted and a warning will be emitted

(from Reference — Nextclade documentation)

The alignment algorithm is explained a little here: 1. Sequence alignment — Nextclade documentation

If you send me a sequence and query I can have a look at parameters that work for you as the documentation for seed matching is still a bit sparse and you might find it hard to how to tune which parameters.

Do we have a problem with our code?
Is there a possibility to limit the ram usage of the augur align command?

In summary I don’t think you have a problem with your code but a problem with using tools in ways they don’t support (unless you have excessive memory and time).

I don’t think it’s possible to limit ram usage of augur align, as I’ve said, it’s just a wrapper around mafft, more convenient to use but with fewer options. Mafft might have an option, but I don’t know. Folks at biostar or a similar forum might know.

I hope that helps. Let me know if you have other questions downstream from alignment, i.e. making a Nextstrain tree with augur from the vcf.

Best,

Cornelius

NyxMoiren · January 12, 2024, 6:15am

Hi Cornelius,

thanks for the detailed reply and sorry for taking so long to answer you back.

We will have a look into the matter and see where your tips can be helpful for our dataset. If we need help again, we will for sure use the forum

Thanks again and all the best!

Topic		Replies	Views
Building a Custom Nextstrain - Alignment Error Help and Getting Started	6	443	April 17, 2023
Avoiding large memory requirements by switching to `nextalign` General	0	883	March 22, 2021
Using existing alignment Help and Getting Started	5	525	January 29, 2022
Error with a flu reference sequence for alignment Help and Getting Started	8	302	March 27, 2024
Augur alignment failing - problem with mafft Help and Getting Started	6	2679	March 22, 2021

Alignments in Nextstrain

Part 1 - How to add to an existing alignment without having to realign previous already aligned genomes

Part 2 - Running out of memory

Part 3 - Using Nextclade

Related topics