IQTREE error: Some sequences (see above) are problematic, please check your alignment again

My build stops at the iqtree stage with the message: “ERROR: Some sequences (see above) are problematic, please check your alignment again”

Job 3: Building tree
Reason: Missing output files: nextstrain_results/tree_raw.nwk; Input files updated by another job: nextstrain_results/aligned.fasta

        augur tree             --alignment nextstrain_results/aligned.fasta             --output nextstrain_results/tree_raw.nwk             --method iqtree             --override-default-args             --substitution-model auto             --nthreads 10             --tree-builder-args "-B 1000"
Building a tree via:
	iqtree -ntmax 10 -s nextstrain_results/aligned-delim.fasta -B 1000 > nextstrain_results/aligned-delim.iqtree.log
	Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
	Mol. Biol. Evol., 32:268-274.

Conducting a model test... see 'nextstrain_results/aligned-delim.iqtree.log' for the result. You can specify this with --substitution-model in future runs.

ERROR: Shell exited 2 when running: iqtree -ntmax 10 -s nextstrain_results/aligned-delim.fasta -B 1000 > nextstrain_results/aligned-delim.iqtree.log
Command output was:
  ERROR: Some sequences (see above) are problematic, please check your alignment again

ERROR: Command '['/bin/bash', '-c', 'set -euo pipefail; iqtree -ntmax 10 -s nextstrain_results/aligned-delim.fasta -B 1000 > nextstrain_results/aligned-delim.iqtree.log']' returned non-zero exit status 2.
Please see the log file for more details: nextstrain_results/aligned-delim.iqtree.log

Building original tree took 0.20879316329956055 seconds

There are many warnings the alignment step. For example:
WARNING: this insertion was caused due to 'N's or '?'s in provided sequences

But the sequences passed all filtering and processing steps earlier. Do I need to manually inspect the sequence, or how can I avoid this error?


Hi @jonr, one quick way to check your alignment for sequences with problematic characters is to run augur index --sequences nextstrain_results/aligned.fasta --output nextstrain_results/alignment_index.tsv. The augur index command produces a table of counts for standard nucleotide characters, other valid IUPAC characters, ambiguous characters (“-”), and other invalid characters. You can filter this table by those counts to find potentially problematic sequences. IQ-TREE will not accept sequences with invalid IUPAC characters, but it should handle the other ambiguous characters.

You can tell augur filter to exclude sequences with invalid characters with the --non-nucleotide flag. Using this flag requires you to provide your sequences as an input along with the metadata.

If you don’t see any issues with the number of invalid characters in your alignment, it would be helpful to visualize your alignment with a tool like AliView.

Thanks @jlhudd !
I inlcuded a bunch of sequences with either only N’s or mostly N’s. I thought these would be filtered out during augur filter and align, but they were still part of the aligned.fasta. Removing them fixed the problem.

Sometimes we get these sequences with only N’s because we create reference-based consensus sequences. But I can include some additional sanity checks before we start the Nextstrain build.

@jonr I’m glad you found the issue! When you run augur filter, you can provide a minimum length per sequence with the --min-length argument which filters based on the number of A, C, G, and T characters in each sequence. For example, you could run the filter command with --min-length 1 to ensure that sequences of all Ns get dropped from your analysis.

1 Like