Exclusion of forced sequences after augur filter step - seasonalflu build

Hey Nextstrain team! I was hoping to get some help with the flu build.

On visualising the tree, I realised that multiple of my h1n1 and h3n2 sequences were getting removed.
I then forced the excluded sequences within the build, realising again they were still being excluded. On looking at the log files, it seemed that they were being forced into the build but removed after the filter step and at alignment. I inspected the majority of the excluded samples and saw that they had a string of approx >20 Ns at the start of the sequence. As I do not have much experience with coding and Nextstrain, I tried to disable polytomies, resolutions and other flags I thought might be causing this, still with no luck. I would love to get some help on how to include these samples. They have been aligned and a tree devised using ggtree with no issues. I know they are considered “problematic” but still have passed our qc criteria, so would still like to have the option to include the in the tree.

Thank you!

Kind regards,
Jess Agius

Sorry for the confusion! I think the skip-diagnostics described here might help

This should skip the QC length check (after dropping prefixed N regions), but do let us know if it works or not

Hi Jen,

Thank you so much for your quick reply and help.

Unfortunately, that did not work. Though, re-reading my post, I mean’t to say that the sequences got dropped at the alignment step, so there is no issue with the filter, they all pass the Augur filter phase in the flu build.

Would love any insight.

Thank you again for your help.
Jess

Hi,

Just revisiting this post. I’ve gone through each step, checking what files make it through. It seems that the files are being dropped after the scripts/codon_align.py is ran. Is there something in this Python script that may cause my sequences to be filtered out?

Cheers,
Jess

Hi @jessagius,

I’m not familiar with the seasonal flu build, but took a quick glance at codon_align.py. It looks like a sequence can get dropped for any of these reasons:

  1. If it does not align (src)
  2. If amino acid alignment (1) fails, (2) contains more than 5 of * or X characters, or (3) the reference AA sequence contains more than 5 - characters (src)
  3. The aligned sequences is not the same length as the reference sequence (src)

Furthermore, since you’ve noted that this is not an issue with augur filter, also note that the force-include options to augur filter (--include, --include-where) do not apply to the codon_align.py script.

– Victor