Hello
I have problems interpreting the entropy data, when I saw them I assumed that it goes from 0 to 1, but in my analyzes I find higher values, (1.1 / 1.2 / 1.3) I have read the posts on nextstrain and some papers but in general, they all limit themselves to say that entropy is a measure of diversity.
Entropy is normalised Shannon entropy, measuring the “uncertainty” inherent in the possible nucleotides or codons at a given position.
Events represent a count of changes in the nucleotide or codon at that position across the (displayed) (sub-)tree. They rely on the ancestral state reconstruction to infer where these changes occured within the tree.
(Docs here and the code which computes it is here.)
As an example, looking at a recent nCoV build at spike position 371 we have entropy of 1.056 which is the sum of each of the 4 observed codons: R (1177 tips / 3199 total tips) entropy=0.36, H (1378 / 3199) entropy: 0.36, L (1 / 3199) entropy: 0.00252, P (643 / 3199) entropy: 0.322.
Hi James. Are each of these entropy values normalized and then added together (meaning they can be >1)? Or how are these entropy values normalized? Also, are they normalized to the whole genome/protein coding regions of the genome or normalized to the ORF they are from? Thanks!
Hi @mathissweet - each position (AA or Nt) is computed independently. For a given position, the count of an observed residue/nuc is normalized by the number of (visible) tips, entropy is calculated, and then we report the sum of these entropies for that position. Code here if that helps.