How to deal with samples with way too many mutations

Sometimes we get contextual samples (pulled by being closely related to my own samples) that have lots and lots of mutations (all samples here with > 100 mutations here are collected in 2020 which is not biologically possible and must be due to sequencing/assembly errors).

I know Nextstrain team collects them and put them to exclude.txt and I could manually do that too, but we have too many automatically built trees for me to keep track of. I also try setting clock_filter_iqd down to 3, which lead to this tree above, should I keep tuning clock_filter_iqd down to a smaller number, or should I try something else?

I’d be intrigued how others would solve this issue. Personally, I think there are a few possible solution, which are either done before or after tree construction.

  1. Integrate NextClade (or similar) into the workflow to call mutations compared to common reference, exclude if they are above a threshold, followed by filtering.
  2. Identify long branches, then prune (eg the R package ape has the drop.tips function or checkout gotree).

I personally think this is a data quality issue and not related to tree construction, so option #1. In other words you need to exclude these sequences before constructing your tree. The augur filter sub command takes in a tsv file, it’s possible to include a variable that contains number of mutations. So it’s possible to dynamically filter these out using an expression such as: –query “mutations > {threshold}”.