Using logistic_growth to identify a more transmissible/fit lineage?

Hi!
I have just finalized a build on ~1750 genomes that we have collected so far in Hawaii.
Lately, the B.1.429 lineage has been the dominating variant detected here.
For this particular build, I looked at the interesting logistic_growth value, calculated as shown in this
fascinating discussion between mostly Trevor @trvrb and John @jlhudd https://github.com/nextstrain/ncov/pull/595 .
I need more time to understand the subtleties of how this calculated value reflects or not a lineage that is more transmissible or fit to expand in a certain context (with limitations caused by non-uniform or insufficient sampling), but I would be very grateful if you could look below and tell me whether do you think this is a red herring or a true sub-lineage of interest.

The B.1.429 lineage seems to be split into a few sub-clades that are either below or above 0 in terms of logistic_growth:


What seems to differentiate the clades that are above from the ones that are below 0 is a mutation in the nucleocapsid protein - N:M234I (M is blue in the screenshot above and I is yellow)

If you look at the time-resolved tree of these samples colored by logistic_growth, the clades below the N:M234I mutation are all orange-red, showing the same idea:

Now, if I look at the latest North America Nextstrain build (which has though only 108 B.1.429 sequences) in the same way:
https://nextstrain.org/ncov/north-america?branchLabel=aa&c=gt-N_234&d=tree,map,frequencies&f_pango_lineage=B.1.429&l=scatter&m=div&p=grid&scatterY=logistic_growth
it looks somewhat similar, since there are two subclades of B.1.429, one with a slightly positive logistic_growth value (~0.38) and one with a slightly negative value (~ -0.05).

Unfortunately, both these values fall into the same color bin if you color by logistic_growth, but you can still see the positive or negative value if you hover over the tips:
https://nextstrain.org/ncov/north-america?branchLabel=aa&c=logistic_growth&d=tree,map,frequencies&f_pango_lineage=B.1.429&m=div&p=grid

What is different is that there are some subclades that are not under N:M234I that still have a positive value for logistic_growth. In fact, everything that is under C12100T and C8947T seem to have a positive value (although in our tree, all the samples have both C12100T and C8947T, but they are still separated in subclades with positive and negative logistic_growth values).

The Andersen lab has a community Nextstrain build that has 517 B.1.429’s, but unfortunately they don’t do their builds calculating logistic_growth at least not yet:
https://nextstrain.org/community/andersen-lab/HCoV-19-Genomics-Nextstrain/hCoV-19/usa/sandiego?branchLabel=aa&c=gt-N_234&d=tree,frequencies&f_pangolin_lineage=B.1.429&p=full

So my question(s) are:

  • can we assert that we see some subclades that have increased transmissibility, or is it just sampling noise?
  • can we link the higher logistic_growth value to a particular mutation or not (it seems not, by looking at the global build data)?

Your insight would be highly appreciated.
Thank you,
Razvan

Your link to the original discussion and code is very useful by the way.

My understanding is that logistic growth L=0.38 is saying that your lineage was exp(L/12)=1.032 times smaller one month earlier (when everything else is kept the same) and a significant growth would be more something like L > 5.

If there are only two lineages under exponential growth a exp(b t), c exp(d t) with t the time in year then their frequencies follow the logistic functions f(t)= a/c exp((b-d)t)/(a/c exp((b-d)t)+1), 1-f(t)

and augur aims at finding the L=b-d matching the frequency data (over one month?) for this lineage.

Indeed. But in fact the logistic growth value in my dataset for the subclade of interest is around 5 (It is 0.38 in the global build only, where there are much less sequences of this clade). It is in fact very similar to the value for the B.1.1.7 and P.1, which is around 8 (data not shown, I might post a screenshot later). My question was more towards whether this value could be just caused by sampling bias. Thanks for chiming in, though!

Our sampling strategy, by the way, is to have a random number of samples from the positive community samples, distributed as evenly as possible across different geographical locations. We do occasionally sequence additional samples from outbreaks, though. In this case, I think that the tree I’m showing includes one small outbreak in the beginning of the timeline. I guess I would have to make a build downsampling that outbreak to mitigate the effect it might have on the logistic growth calculation.

I just wanted to add a screenshot of the logistic growth scatter plot for all the variants of concern in our dataset. B.1.1.7 and P1 are at about 8, which makes perfect sense.
There are a few other VOCs (B.1.351, B.1.427 and B.1.526) which are below zero for logistic growth. The tips of these lineages though end in the past (there are no recent samples within these lineages).
The B.1.429 though has different sub-lineages at either positive or negative logistic growth values (the positive ones being around 5).
This is what prompted my initial question - is this a good reason to label a sub-clade as more transmissible or fit or it’s likely sampling noise?
Looking at the screenshot attached to this post, it seems that all the lineages that have tips in the past are below 0, which makes sense, but there is one sub-lineage in B.1.429 that is still sampled in the present, but its logistic growth value is below zero. Does that mean that it is maybe a less transmissible/fit sub-lineage?