Complex multi-nucleotide and indel mutation events

As we have started to get more data, we have begun to accumulate a number of complex mutation events. For example, three consecutive substitutions or a substitution and deletion is close proximity. We know from other mutational studies that these complex mutation events (come are calling these MNVs in the human world) are a single mutation event not a series of consecutive events.

First off props to nextstrain for annotating these correctly at the amino acid level (while most fail)!

However, looking at the temporal branches, I think these are being counted as more than one event based on the long branch lengths. For an example, we have a local outbreak we are investigating where a triple substitution happens to be a hallmark (20263-20265, ORF1b, Glu2266->Ile) .

see
https://nextstrain.org/ncov/north-america?c=division&f_division=Oregon&f_region=North%20America&r=division

Look at USA/OR-OHSU-0176/2020 (and cluster).

So my questions are:

  1. Am I correct that augur is treading these all as separate mutations? In my example, 4 mutations (triplet and another single mutation) on this branch and not 2. If so, are there any work arounds for these apart from hacking the sequences? While these are not the most common mutation event, they do occur in every species I know of (~5% of the de novo mutations in human) and we have a handful of these in the first 100 genomes we sequenced.

  2. This tree also has another good example re indels. USA/OR-OHSU-0177/2020 has a 6 nt deletion that separates it from the rest of the outbreak. From that I could tell, at some point indels stopped being considered several weeks ago now in the main build parameters. I was curious why and if altering this setting in our local builds would cause a lot of issues.

Thanks for posting Brian!

  1. Like BEAST and most other methods to estimate a molecular clock, augur (and its use of TreeTime) assume that sites are independent and so observing 3 mutations along a branch counts as three mutations in terms of a molecular clock and the parent node to this branch would be assumed to be older than it really should be. I don’t know of an easy fix here, though perhaps @rneher might. If this was really a single mutation event as proposed, we’ll expect to never see this branch split as more samples come in. The base of clade 20B has a similar 3-mer event. Really interesting and unfortunate that methods don’t deal with this well.

  2. I believe we should be handling deletions appropriately at this point. Although note that we treat N in an input sequence as ambiguous and infer its most likely nucleotide state. For sequences with missing bases that end up as - in the alignment we preserve this state. You can see an example here: https://nextstrain.org/ncov/asia?c=gt-ORF8_57&f_country=Singapore. This is the circulating deletion variant in Singapore. You can see the deletion in USA/OR-OHSU-0177/2020 here: https://nextstrain.org/ncov/north-america?c=gt-ORF1a_84&f_division=Oregon&m=div. I guess your surprise is that the deletion doesn’t contribute to branch length? This is pretty par for the course in terms of phylogenetics, but I admit it’s non-ideal.

Re: 1. That is what I assumed would be the case. We’ve used the divergence view to compensate for now, This is been a pet peeve on mine as soon as we started finding the first de novo multi nucleotide mutations in humans and then finding out this is seen through many species. USA/OR-OHSU-0126/2020 and USA/OR-OHSU-0135/2020 is another good example w/ related samples closer in time without the mutation event.

As you point out for a long time it has been assumed that all substitutes are independent in most models. Maybe an opportunity to innovate? These events tend to be fairly local (w/in a few nucleotides).

Re: 2. My understanding was the alignments switched to using the --fill-gaps option by default and this was filling in any deletions in the alignment (as least in the builds I was running in May). Sorry I assumed this was still the case and this is why the deletions were not contributing to the branch length/divergence in the latest build. However, I just checked the latest and I don’t see this option listed anymore.

So if deletions are flagged correctly in the alignments/annotations, then they are just ignored for the phylogenetics?

Hi Brian,

yes, we assume sites are independent. there are attempts to fix this locally, for example by using codon models. But the computational cost increases rapidly with the size of the units you consider.

And you are right, deletions are ignored in the branch length calculation in TreeTime (and insertions are stripped from the alignment before the tree building even starts).

Thanks @rneher @trvrb. @rneher Just curious about how you are adjusting in the “attempts to fix locally”?

I have a few rough ideas on how to evaluate the effect of these events and their impact on the trees but not really the bandwidth to evaluate if anyone is interested we could setup a chat. For example, looking multiple mutations with a branch and evaluating the distance distribution. My prediction is these will be obvious outliers (likely w/in 10 nts). Then looking at these events at branches in more densely sampled regions (e.g., UK) and evaluate the estimate timing versus actual. As well as how often an intermediate stage is seen (w/ 1 or 2 mutations, not the final set).

I think based on these data you could flag likely multinucleotide events just based on their distance from each other (not computationally intensive) and then have a fudge factor for the time in that branch. Could be adjusted based on the likelihood estimates that it was a single versus multiple events. This being said, I don’t know a lot about what is happening under the hood.

by “locally” I meant for adjacent nucleotides, like defining transitions matrices for triplets (codons).

One hack to account for some of this would be to mask all but one mutation of such multinucleotide events for branch lengths calculations.

we currently scan for clusters of changes and exclude sequences with more than 6 changes in 100bases. Most of these are alignment or bioinformatic issues. But there is a chance that we throw out bona fide variation.