Exclusion of forced sequences after augur filter step - seasonalflu build

jessagius · January 10, 2023, 10:04pm

Hey Nextstrain team! I was hoping to get some help with the flu build.

On visualising the tree, I realised that multiple of my h1n1 and h3n2 sequences were getting removed.
I then forced the excluded sequences within the build, realising again they were still being excluded. On looking at the log files, it seemed that they were being forced into the build but removed after the filter step and at alignment. I inspected the majority of the excluded samples and saw that they had a string of approx >20 Ns at the start of the sequence. As I do not have much experience with coding and Nextstrain, I tried to disable polytomies, resolutions and other flags I thought might be causing this, still with no luck. I would love to get some help on how to include these samples. They have been aligned and a tree devised using ggtree with no issues. I know they are considered “problematic” but still have passed our qc criteria, so would still like to have the option to include the in the tree.

Thank you!

Kind regards,
Jess Agius

quietjen · January 11, 2023, 6:31pm

Sorry for the confusion! I think the skip-diagnostics described here might help

Why do my sequences end up in excluded_by_diagnostics.txt? - #2 by joverlee

This should skip the QC length check (after dropping prefixed N regions), but do let us know if it works or not

jessagius · January 19, 2023, 11:06pm

Hi Jen,

Thank you so much for your quick reply and help.

Unfortunately, that did not work. Though, re-reading my post, I mean’t to say that the sequences got dropped at the alignment step, so there is no issue with the filter, they all pass the Augur filter phase in the flu build.

Would love any insight.

Thank you again for your help.
Jess

jessagius · January 30, 2023, 7:32am

Hi,

Just revisiting this post. I’ve gone through each step, checking what files make it through. It seems that the files are being dropped after the scripts/codon_align.py is ran. Is there something in this Python script that may cause my sequences to be filtered out?

Cheers,
Jess

victorlin · January 30, 2023, 5:36pm

Hi @jessagius,

I’m not familiar with the seasonal flu build, but took a quick glance at codon_align.py. It looks like a sequence can get dropped for any of these reasons:

If it does not align (src)
If amino acid alignment (1) fails, (2) contains more than 5 of * or X characters, or (3) the reference AA sequence contains more than 5 - characters (src)
The aligned sequences is not the same length as the reference sequence (src)

Furthermore, since you’ve noted that this is not an issue with augur filter, also note that the force-include options to augur filter (--include, --include-where) do not apply to the codon_align.py script.

– Victor

Topic		Replies	Views
Error with a flu reference sequence for alignment Help and Getting Started	8	308	March 27, 2024
All samples dropped during augur filter	29	2206	January 24, 2022
Why do my sequences end up in excluded_by_diagnostics.txt? Help and Getting Started	3	746	October 18, 2022
Problems creating a SARS-CoV-2 BA.5 build Help and Getting Started	3	435	June 28, 2022
How to deal with samples with way too many mutations General	2	649	August 5, 2022

Exclusion of forced sequences after augur filter step - seasonalflu build

Related topics