Hi! Thanks for reaching out with such a well thought through question.
As a caveat, I don’t know much about bacterial genomics, so my answer will come from the perspective of someone treating the genome essentially as a huge virus that’s 20 times longer than mpox virus.
Part 1 - How to add to an existing alignment without having to realign previous already aligned genomes
Currently, we use a vcf file generated by parsnp from harvesttools for the alignment ( for 800 genomes, its usually finished in 4 hrs on a 200 GB RAM machine with 80 cores available, but we used 10-20 normally).
The problem is, that when adding newer genomes, we have to run the entire dataset from scratch.
Do I understand correctly that this part does not (yet) involve augur?
If I understand it correctly, you should be able to individually align each sequence against the same reference, outputting an aligned fasta (with insertions relative to reference stripped) for each (query) sequence. Then you append all the individual fastas together into one big fasta. This fasta can then be incrementally added to with new sequences without having to realign the rest.
You can easily create a VCF from this big fasta with a tool like Ucsc Fatovcf :: Anaconda.org
This should in theory solve the incremental alignment problem independent of the tool used for (reference) alignment.
If you do a proper multiple sequence alignment with MAFFT, this won’t work as that way you cannot strip insertions. I don’t have experience beyond simple reference alignments (yet) so can’t help here unfortunately.
Part 2 - Running out of memory
augur align should be able to add genomes to an existing alignment (which in turn, would suite for cumulative data analysis).
When testing augur align with 4 genomes (and planning to add a fifth to the alignment) we ran out of memory in the initial alignment of the 4 genomes (or so we think).
When adding a genome to an existing alignment from parsnp we had to stop the process as soon as we hit our RAM limit of 200 GB.
I can’t speak about parnsp, but augur align is just a wrapper around mafft. So it does a “proper” multiple sequence alignment which is much more complex than a simple one-by-one query vs reference alignment. MAFFT just can’t handle such big genomes I presume - but I haven’t used MAFFT much so I don’t know what’s going on here.
As a reminder, I don’t know much about bacteria and there might be other tools that work for that situation. Augur is primarily built with for smaller virus genomes, so the kind of things you’re trying to do aren’t as well tested and augur might not be able to do everything you’d want it to do.
Part 3 - Using Nextclade
In contrast to bacteria, this is my core area of expertise We’ve overhauled seed matching in version 3 of Nextclade which is currently in alpha release (so you can already use it, we’ll properly release it within a month or so) and already used by us in production for various workflows. You can download it here: Release 3.0.0-alpha.1 · nextstrain/nextclade · GitHub
So the sentence ““seed matching is currently not very robust and might fail if there is repeated sequence or if the sequences are too diverged.”” will probably no longer apply, i.e. it should be possible to use Nextclade to align your sequences against a reference. One might need to tune alignment/seed matching parameters however. I’m happy to help if you send me a few example sequences and a reference to align to.
Beware though that what Nextclade does is a “simple” one-by-one 1 query to 1 reference alignment which is means that insertions are stripped and information between query sequences is ignored. This makes the problem computationally much simpler, faster, and require less memory, but it comes at a loss of information that one needs to be aware of.
Documentation for v3 can already be looked at here: Nextclade: analysis of viral genetic sequences — Nextclade documentation
You might need to tweak the following parameters to get a fast and robust alignment that doesn’t take up too much memory:
--kmer-length <KMER_LENGTH> — Length of exactly matching k-mers used in the seed alignment of the query to the reference
--kmer-distance <KMER_DISTANCE> — Interval of successive k-mers on the query sequence. Should be small compared to the query length
--allowed-mismatches <ALLOWED_MISMATCHES> — Exactly matching k-mers are extended to the left and right until more than allowed_mismatches are observed in a sliding window (window_size)
--window-size <WINDOW_SIZE> — Size of the window within which mismatches are accumulated during seed extension
--min-match-length <MIN_MATCH_LENGTH> — Minimum length of extended k-mers
--min-seed-cover <MIN_SEED_COVER> — Fraction of the query sequence that has to be covered by extended seeds to proceed with the banded alignment
--max-alignment-attempts <MAX_ALIGNMENT_ATTEMPTS> — Number of times Nextclade will retry alignment with more relaxed results if alignment band boundaries are hit
--excess-bandwidth <EXCESS_BANDWIDTH> — Excess bandwidth for internal stripes
--terminal-bandwidth <TERMINAL_BANDWIDTH> — Excess bandwidth for terminal stripes
--max-band-area <MAX_BAND_AREA> — Maximum area of the band in the alignment matrix. Alignments with large bands are slow to compute and require substantial memory. Alignment of sequences requiring bands with area larger than this value, will not be attempted and a warning will be emitted
(from Reference — Nextclade documentation)
The alignment algorithm is explained a little here: 1. Sequence alignment — Nextclade documentation
If you send me a sequence and query I can have a look at parameters that work for you as the documentation for seed matching is still a bit sparse and you might find it hard to how to tune which parameters.
Do we have a problem with our code?
Is there a possibility to limit the ram usage of the augur align command?
In summary I don’t think you have a problem with your code but a problem with using tools in ways they don’t support (unless you have excessive memory and time).
I don’t think it’s possible to limit ram usage of augur align, as I’ve said, it’s just a wrapper around mafft, more convenient to use but with fewer options. Mafft might have an option, but I don’t know. Folks at biostar or a similar forum might know.
I hope that helps. Let me know if you have other questions downstream from alignment, i.e. making a Nextstrain tree with augur from the vcf.