A user just noted that for our HIV-1 M group genome, gag, pol, and env trees, the predicted ancestral nodes at the very root (chimpanzee to human transfer) and various nodes (such as the subtype B/D split) the dates are very different between the trees.
Where can I find more information on how those dates are inferred/calculated? Is the distance/time molecular clock rather linear in these data sets, or is there a curve to account for saturation of silent sites etc?
Hi @bfoley, assuming that these trees are produced using Nextstrain’s augur tools, we infer time trees with the augur refine command which then runs TreeTime’s time tree inference. You can learn more about how TreeTime works at the official documentation, but the paper by Sagulenko et al. is a great starting place to learn about the theory and implementation details for this tool.
Looking at the augur refine interface, you’ll find options to specify a clock rate and std dev of the clock rate, but by default augur refine will try to infer the clock rate from the given sample collection dates in the metadata, the genome sequences in given alignment, and the input phylogeny. You can find the inferred clock rate parameters in the “clock” key of the JSON file associated with the --output-node-data.
You’ll also find an augur refine argument called --coalescent that allows to control the coalescent prior used during the time tree inference. You can specify a coalescent rate as a number or ask augur to infer a scalar rate with the “opt” argument value or ask to infer a variable rate through time with a “skyline” argument value. The paper linked above provides more details about these options.
If the rate is underestimated, the date of the common ancestor will be overestimated. The date of the HIV-1 envelope gene M group common ancestor is 1866 in our tree. Envelope is the fastest evolving gene because of positive selection pressure to evade host antibody and T-cell immune selection pressures. The date of the HIV-1 polymerase gene M group common ancestor is 1904 in our tree. Pol gene is the most conserved, slowest evolving gene because it is critical to function well for virus replication and not under any neutralizing antibody selection pressure.
I can think if a few good reasons why all of the methods will underestimate the rate, and few reasons why any method would overestimate the rate (in protein coding nucleotide sequences), so given that, the overestimate of the time to ancestor, given all the other data we have from epidemiology etc., makes sense.
I’m glad that helped a bit, @bfoley! If you already have already calculated more reliable clock rates (e.g., with BEAST), it would be interesting to know whether your time trees from augur refine look more reasonable when you provide those clock rates via the --clock-rate and --clock-std-dev arguments. When you provide those arguments, the underlying TreeTime algorithm won’t try to infer the clock rate and will just use what you provide.
But anyway, the HIV-1 M group date/time vs distance issue is not influenced much by silent site saturation, it is due to selection pressures, non-homologous sites (repeated insertion/deletion events in a few key regions) and some other issues. Ste-specific rate corrections are not the whole solution. I am pretty sure that eliminateing the non-homologous columns/regions and using the rates calculation provided by methods/tools such as IQ-tree (-wsr write site rates option; or Gary Olsen’s DNArates program) to do that in a more unbiased way than just eye-balling the alignment and stripping columns that look bad to me,
The trade-off on doing things like this are that reviewers of the work will accuse us of “cherry picking” the data etc. and it is difficult to describe the data set in the Methods section of a publication. But indeed it would be very interesting here, where we have many other lines of evidence about dates and rates in this data set, to explore several methods, such as using the same data but fiddling with the rate parameter of fiddling with the input data set and letting TreeTime or BEAST calculate the rate from the “improved” data set.
The attached screenshot is from a paper by Vadim Puller, Pavel Sagulenko and Richard Neher. The plots here show that in theory the difference between true and inferred distances does not become a large problem until the true/inferred branch lengths are >0.1 or the true/inferred root to tip distances are >0.5 or so.
In my HIV-1 M group data sets, the IQ-tree inferred distances back to the M group root are 0.03 to 0.04 for pol gene and 0.2 - 0.25 for env gene, https://nextstrain.org/groups/LANL-HIV-DB/HIV/pol?m=div
ttps://nextstrain.org/groups/LANL-HIV-DB/HIV/env?m=div
So, it seems we should be well within the range where the true vs inferred distances and dates would be close to the same. But just the fact that the inferred distances for pol gene are so different than the inferred distances for env gene, should be a clue that we are off in estimating the distances. I am guessing the this difference is mainly due to the difference between evolution selection pressures acting on the viruses, and the site-specific rates of evolution being modeled in “synthetic” data sets.
In the screenshots of Pol and Env trees I have highlighted the South Korean subclade within subtype B which has been extensively studied in large part due to 20 South Korean hemophiliacs becoming infected with HIV-1 within months of the first HIV-1 cases being diagnosed in South Korea. We know the date of origin of this “outbreak” to be around 1990 - 1991. The nextstrain Env tree estimates it at 1945 and the nextstrain pol tree estimates it at 1974.
@XingguangLi The short answer is surely selection pressure on the coding regions, making the actual evolution of the virus differ a lot from expected evolution being modeled by phylogenetic inference programs. The issue isn’t so much that these inferred dates are wrong, we know a lot about the epidemiology of HIV-1 M group and have solid dates for many internal nodes in the phylogeny. The bigger issues are what are the best ways to correct for the selection pressures problem, and can we learn from the problem with HIV dates reconstruction, anything that will help us make corrections in other phylogenetic data sets.
HI @bfoley
the South Korean cluster is interesting. The typical divergence of the pol sequences from the base of the cluster is between 0.005 and 0.01. With an estimated rate of 2.8e-4 per year, it is not unsurprising that we get this too old Tmrca estimate. My suspicion would be that we are seeing this a strong time “dependent rate” were short term evolution is dominated by rapid T-cell escape that isn’t seen on larger scales due to rapid reversion.