How do we establish a private mutation threshold for new datasets?

Hi everyone,

I’m currently testing the creation of Nextclade datasets for some viruses, and I’m trying to understand how to choose appropriate values for the typical and cutoff fields in the privateMutations QC section of pathogens.json.

From the documentation, I understand the purpose and logic behind this QC metric and how the thresholds work. However, what I’m struggling with is how these values should actually be defined in practice.

Is there any established convention or recommended approach for determining them?

As a first attempt, I ran Nextclade on a set of publicly available genomes for the virus I’m working on (assuming a mixture of high- and low-quality sequences) and inspected the “total private mutations” metrics (reversion, labeled, unlabeled, and all types) using density plots. This produced some interesting patterns, but I’m unsure whether this is considered a good or standard method.

Does anyone have suggestions or best practices for deriving these threshold values?

Best,

Hi Filipe @fdezordi

The QC in Nextclade is a set of empirical measures. They are more or less a convenience feature - instead of looking at all mutations in all sequences and analyzing patterns by hand, they provide a numeric summary for quick assessment. Among downsides is that the large amount of information and nuance is lost.

The way QC metrics are configured and interpreted is open to the dataset authors and their users. There are no conventions or best practices established for finding these settings. Neither it can be done, I think, because biology is so diverse and the goals of users vary a lot.

Your intuition was right, and looking into the statistics of existing sequences is a good way to go. At least this is what we try to do as well. For us, it is mostly a mixture of prior statistics, a healthy portion of expert judgment from people knowledgeable about a particular virus, and some experimenting and looking at the results over time. Most of the intuition has been built during SC2 pandemic as we observed the rapid evolution of the virus.

In general, feel free to experiment and find out the settings which make sense for your particular virus and your use-case (or use-case you are tailoring for your users). The idea is that the resulting scores would provide useful hints for people, such that they focus their attention on problematic sequences and study them further. You could document your findings in the dataset’s readme file, for people to better understand your setup (you could optionally include links to relevant git repos, scripts, images with charts etc.). It does not have to be spot on from the first try - the settings could be changed over time (e.g. as virus evolves, or as new evidence comes in), but we recommend to clearly communicate all changes to the users in the changelog document.

Don’t hesitate to open a more concrete discussion thread, with numbers, for a specific virus. We don’t have experts on all viruses, but there’s a chance someone in the broader Nextstrain community is.

If you are considering contributing your dataset(s) to the nextclade_data repo, feel free to open a draft PR early, so that people can discuss things there as well. Datasets don’t have to be perfect - they can be improved over time and collectively.

1 Like

Hi Ivan,

Sorry for my late reply.

Thank you very much for the feedback — it’s great to know that my intuition was correct. I’m currently working with a team to create datasets for several viruses, and we submitted a PR to nextclade_data at the end of last week with the first dataset (for zika virus).

Thank you again for the feedback and for providing this tool. It has been extremely helpful for us in Brazil in analysing multiple viruses.

Best,

Filipe