How Shannon entropy calculated per codon?

I would like to know how Shannon entropy is calculated and by what formula and based on what characteristics for each codon in Nextstrain? And how and based on what characteristics the uncertainty is calculated for each codon?
For example, based on the number of mutations in each codon?
Please Help me

1 Like

It’s normalized Shannon entropy of the valid nucleotides / codons at each site. Code is here if you’d like to explore more.

2 Likes

Thanks
Can you download ncov_global.json in https://github.com/Developercovid/auspise.git and see it in https://auspice.us/.
for example for Codon 220 in protein N number of mutation(Events) is 1 and Entropy of this codon is 0.690.but When I calculated Entropy with formula that you send for this codon the result is 0.055.why?

@Mahan.iz I also want to know the formula of this. Thanks for sharing wonderful info with me. This link was very helpful for me.

@Mahan.iz Hi, you have 47 sequences with the mutation, the remaining 102-47 sequences don’t have it, when taking a sequence uniformly randomly the probability that it has the mutation is p = 47/102

and the entropy is -p*ln(p)-(1-p)*ln(1-p) = 0.69006827928

2 Likes

Hello,

I am following up here to receive some clarification on the formula Nextstrain uses for calculating normalized Shannon entropy. Would you be willing to confirm whether the below formula is the one coded in the linked code?

where i is the first amino acid identity observed. i ranges from 1 to the kth identity of amino acids observed, and p is frequency of the aa. p= (counts of sequences containing the aa at the position)/(total tips in the tree at that position).

Would you also be willing to provide an appropriate citation for this specific approach so I can learn more? Thanks in advance for your help.