I’m playing around with augur frequency, and two questions came up:
I was wondering if and how internal nodes and branches are taken into account to calculate frequencies deeper in the tree, where actual tips are not available for frequency estimations. Are both internal nodes and branches taken into account? Is each of these elements counted as one occurrence of that trait being measured?
Is there any other parameter one can change to obtain more accurate frequency curves? By playing with narrow_bandwidth (0.005 - 0.5, picture below), I got distinct smoothed curves. What other parameters could be tuned to get more accurate plots?
I was wondering if and how internal nodes and branches are taken into account to calculate frequencies deeper in the tree, where actual tips are not available for frequency estimations. Are both taken into account? Is each of these elements counted as one occurrence of that trait being measured?
Only individual tips contribute to the frequencies calculations. The frequencies shown at any given point in time reflect the sum of tip frequencies at that time. This means the frequencies panel can never show frequency estimates prior to the earliest tip in the tree. If you are interested in more details about how these frequencies are calculated, check out this draft how-to guide on how to calculate change in clade frequencies.
If you use the --include-internal-nodes flag, Augur will calculate the frequency of each internal node in the tree as the sum of its children tips at each timepoint. We don’t use this flag for any Nextstrain builds, but it can be helpful for custom downstream analyses.
Is there any other parameter one can change to obtain more accurate frequency curves? By playing with narrow_bandwidth (0.005 - 0.5, picture below), I got distinct smoothed curves. What other parameters could be tuned to get more accurate plots?
Any of the following parameters will allow you to tune the smoothness of your frequency estimates:
--pivot-interval: the number of discrete time steps between each timepoint that frequencies are estimated. Increasing this value can smooth out the frequency curves by sampling at fewer timepoints.
--pivot-interval-units (added in Augur 10.3.0): the units of the pivot interval values above in either “months” or “weeks”. When used with the --pivot-interval argument, this argument allows you to increase or decrease the number of timepoints when you estimate frequencies.
--narrow-bandwidth: controls the bandwidth of the initial KDE normal distribution calculated for each tip. Each tip gets a normal distribution and the frequency of all tips at a given time is the value of each tip’s normal distribution at that time normalized to sum to 1. (See the draft how-to guide mentioned above for more details.)
--wide-bandwidth: controls the bandwidth of an optional second KDE normal distribution that is added to the initial distribution by a proportion specified by the --proportion-wide argument. We typically use this second KDE distribution with a wider bandwidth to provide slightly longer tails to the initial normal distribution. We don’t use this argument often, in practice, but it could be helpful for tuning.
--proportion-wide: controls the proportion of the wide (second) KDE distribution that gets added to the narrow (initial) distribution for each tip at each timepoint.
Whether any specific parameters provide “the most accurate” estimate of frequencies is an open question. For influenza A/H3N2 analyses, we’ve tuned these parameters to optimize the accuracy of our forecasting models. For SARS-CoV-2, I think you would need to establish an independent baseline of truth to tune your parameters against.
When I tried looking at the frequencies.json files I didn’t understand anything, but now with your
guide it became very clear, and I could easily compute some tip frequencies for the sequences I am adding dynamically (in javascript) to the tree and add the frequency panel to my app.
Thanks to the “normalize frequencies” button in auspice I don’t really care that my tip frequencies are just some Gaussians with fixed std and mean at the tip date, and that they don’t sum to 1.