Hi @D600
Sorry for late reply.
This is a historical artefact - some Nextstrain datasets used one metric and some another metric, so now it’s a bit confusing.
They differ in that “subs per site per year” is “subs per year” divided by the length of the sequence (number of sites). They are not the same, but you can convert from one to another by dividing/multiplying by sequence length. Once you converted to the same units, they can be compared.
The sequence length is of course different for different organisms. And the main inconvenience is how to find it. Typically for Nextstrain datasets there should be a workflow repo on GitHub and there should be either reference.fasta
or reference.gb
file containing the reference genome, which can then be measured (e.g. using seqkit
).
For example, the length of reference sequence for H3N2 HA dataset (can be found here) is 1737:
$ curl -so reference.fasta https://raw.githubusercontent.com/nextstrain/seasonal-flu/master/config/h3n2/ha/reference.fasta
$ seqkit stats reference.fasta
file format type num_seqs sum_len min_len avg_len max_len
reference.fasta FASTA DNA 1 1,737 1,737 1,737 1,737
So if we take as an example the rate of 4.56e-3 subs/site/year, then it’s
4.56e-3 subs/site/year * 1737 sites = 7.92072 subs/year
(I don’t actually remember what’s the rate for H3N2, just taking the number from your example)
For the inverse, SARS CoV-2 Wuhan reference (here) is 29903 sites long:
$ curl -so reference.fasta https://raw.githubusercontent.com/nextstrain/ncov/master/defaults/reference_seq.fasta
$ seqkit stats reference.fasta
file format type num_seqs sum_len min_len avg_len max_len
reference.fasta FASTA DNA 1 29,903 29,903 29,903 29,903
So
35.145 subs/year / 29903 sites = 1.175300137e-3 subs/site/year
We can see that flu mutates “faster” if we take per-site metric, but SC2 has a much longer genome, so it accumulates more mutations in a unit of time overall. Both metrics are useful, because they give different perspective.