How to deal with samples with way too many mutations

dlu · September 30, 2021, 6:34pm

Sometimes we get contextual samples (pulled by being closely related to my own samples) that have lots and lots of mutations (all samples here with > 100 mutations here are collected in 2020 which is not biologically possible and must be due to sequencing/assembly errors).

I know Nextstrain team collects them and put them to exclude.txt and I could manually do that too, but we have too many automatically built trees for me to keep track of. I also try setting clock_filter_iqd down to 3, which lead to this tree above, should I keep tuning clock_filter_iqd down to a smaller number, or should I try something else?

aaziz · October 1, 2021, 4:31am

I’d be intrigued how others would solve this issue. Personally, I think there are a few possible solution, which are either done before or after tree construction.

Integrate NextClade (or similar) into the workflow to call mutations compared to common reference, exclude if they are above a threshold, followed by filtering.
Identify long branches, then prune (eg the R package ape has the drop.tips function or checkout gotree).

I personally think this is a data quality issue and not related to tree construction, so option #1. In other words you need to exclude these sequences before constructing your tree. The augur filter sub command takes in a tsv file, it’s possible to include a variable that contains number of mutations. So it’s possible to dynamically filter these out using an expression such as: –query “mutations > {threshold}”.

PatrickSSte · August 5, 2022, 8:05pm

Hi both, did you see my suggestion?

A picture may make it clearer

patrick.ss.home@gmail.com

Topic		Replies	Views
Private mutations General	1	473	April 6, 2022
Pruning leaf --clock-filter-iqd General	4	1272	January 23, 2021
Trends in the prevalence of "private mutations" General	11	948	July 19, 2022
Sequence missing after certain dates General	5	227	January 16, 2024
Resource for representative nucleotide changes for Nextstrain clades General	1	44	August 2, 2024

How to deal with samples with way too many mutations

Related topics