Comparable mutation metrics?

Greetings,

To the experts this should be a simple question.
When I see the mutation rate estimate for influenza A stated as 4.56e-3 subs per site per year,

while for SARS CoV-2 stated as 35.145 subs per year, are these the same measurement units??

If not how can I compare them?
Thank you

Hi @D600

Sorry for late reply.

This is a historical artefact - some Nextstrain datasets used one metric and some another metric, so now it’s a bit confusing.

They differ in that “subs per site per year” is “subs per year” divided by the length of the sequence (number of sites). They are not the same, but you can convert from one to another by dividing/multiplying by sequence length. Once you converted to the same units, they can be compared.

The sequence length is of course different for different organisms. And the main inconvenience is how to find it. Typically for Nextstrain datasets there should be a workflow repo on GitHub and there should be either reference.fasta or reference.gb file containing the reference genome, which can then be measured (e.g. using seqkit).

For example, the length of reference sequence for H3N2 HA dataset (can be found here) is 1737:

$ curl -so reference.fasta https://raw.githubusercontent.com/nextstrain/seasonal-flu/master/config/h3n2/ha/reference.fasta 
$ seqkit stats reference.fasta 
file             format  type  num_seqs  sum_len  min_len  avg_len  max_len
reference.fasta  FASTA   DNA          1    1,737    1,737    1,737    1,737

So if we take as an example the rate of 4.56e-3 subs/site/year, then it’s

4.56e-3 subs/site/year * 1737 sites = 7.92072 subs/year

(I don’t actually remember what’s the rate for H3N2, just taking the number from your example)

For the inverse, SARS CoV-2 Wuhan reference (here) is 29903 sites long:

$ curl -so reference.fasta https://raw.githubusercontent.com/nextstrain/ncov/master/defaults/reference_seq.fasta
$ seqkit stats reference.fasta 
file             format  type  num_seqs  sum_len  min_len  avg_len  max_len
reference.fasta  FASTA   DNA          1   29,903   29,903   29,903   29,903

So

35.145 subs/year / 29903 sites = 1.175300137e-3 subs/site/year

We can see that flu mutates “faster” if we take per-site metric, but SC2 has a much longer genome, so it accumulates more mutations in a unit of time overall. Both metrics are useful, because they give different perspective.