Error: Alignment must have at least 3 sequences

jbarnell · July 7, 2021, 6:05pm

Hello,

It’s been awhile since I’ve run the local build of nextstrain on my computer, so I went ahead and updated everything, first from github, then from the CLI using “conda update --all”

Ever since the update, I have not been able to successfully run the analysis, looks like a lot has changed, and I was wondering if someone could help me nail down where this issue is coming from.

I am not sure why it is asking for 3 sequences in the alignment. Is it saying that none of my sequences passed the quality checks, and now only the two reference sequences (Wuhan/Hu-1/2019 and Wuhan/WHO1/2019) remain?

If it helps, those two reference sequences are the only sequences in my aligned-delim.fasta file in the results folder. Almost all of the sequences I have passed to nextstrain have passed the quality checks in the past. I’m not sure where to look next.

Thanks for your help,
Jonathan

augustii · April 30, 2022, 12:37am

Hi! I’ve been having the same issue too. Have you been able to pin down the problem? Thanks!

corneliusroemer · May 2, 2022, 6:40pm

@augustii Have a look at the logs of the rule subsample: ncov/main_workflow.smk at c9c48391bc911f4d4b262e13618ea73ca788e223 · nextstrain/ncov · GitHub

This is one of the places where sequences may disappear.

The path to the logs looks like this: logs/subsample_{build_name}_{subsample}.txt

In the Snakemake workflow that I linked to above you can see the output files from that rule. It’s worth checking them and the rules that consume these output files.

It’s hard to debug something like this from afar but if you share logs and input output files etc we can try to get there and maybe help others who have a similar issue.

I’d also be curious whether @jbarnell has figured it out in the meantime - although it’s been quite a while, sorry for that.

henrykmer · August 31, 2022, 2:51am

Same issue, cloned nextrain and ncov yesterday, running native.

Not sure why near all the example data is being dropped.

Job 10: 
        Combine and deduplicate FASTAs
        
Reason: Input files updated by another job: results/default-build/sample-all.txt


        augur filter             --sequences results/aligned_reference_data.fasta.xz             --metadata results/sanitized_metadata_reference_data.tsv.xz             --exclude-all             --include results/default-build/sample-all.txt             --output-sequences results/default-build/default-build_subsampled_sequences.fasta.xz             --output-metadata results/default-build/default-build_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_default-build.txt
        
488 strains were dropped during filtering
	240 had no metadata
	250 of these were dropped by `--exclude-all`
	250 strains were added back because they were in results/default-build/sample-all.txt
2 strains passed all filters

Note: You did not provide a sequence index, so Augur will generate one. You can generate your own index ahead of time with `augur index` and pass it with `augur filter --sequence-index`.
0 strains were dropped during filtering
2 strains passed all filters
[Wed Aug 31 10:21:38 2022]
Finished job 7.
10 of 30 steps (33%) done
Select jobs to execute...

[Wed Aug 31 10:21:38 2022]
Job 6: Building tree
Reason: Missing output files: results/default-build/tree_raw.nwk; Input files updated by another job: results/default-build/filtered.fasta

Same error

Error in rule tree:
    jobid: 6
    output: results/default-build/tree_raw.nwk
    log: logs/tree_default-build.txt (check log file(s) for error message)
    conda-env: path-to/ncov/.snakemake/conda/606fba2748c6c88ce497ee03a13af39a_
    shell:
        
        augur tree             --alignment results/default-build/filtered.fasta             --tree-builder-args '-ninit 10 -n 4'             --exclude-sites defaults/sites_ignored_for_tree_topology.txt             --output results/default-build/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_default-build.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Result is >

ncov/results/default-build/
aligned.fasta:

Wuhan-Hu-1/2019
Wuhan/WH01/2019

Testing different includes >

defaults/include.txt
original:
Wuhan/Hu-1/2019
Wuhan/Hu-1/2019 #why 2X?
Wuhan/WH01/2019

Changed to:
Wuhan/Hu-1/2019
Wuhan/WH01/2019
Wuhan/WH04/2019

and to:
Wuhan/Hu-1/2019

Same result

limxr01 · October 21, 2024, 1:40am

Hi everyone, I’m currently encountering the same error and was wondering if anyone here has managed to resolve it. If so, would you be kind enough to share how you addressed the issue? Your insights would be greatly appreciated. Thank you in advance!

jlhudd · November 1, 2024, 9:54pm

Hi @limxr01, the problem that @henrykmer described may be different than the problem @jbarnell originally described, but I would guess that there is a mismatch in the strain names in the input sequences and metadata files. For example, in @henrykmer’s log output above, the following lines indicate that augur filter found 490 (488 plus the 2 it kept) strains, but 240 didn’t have metadata and 250 had sequences:

488 strains were dropped during filtering
	240 had no metadata
	250 of these were dropped by `--exclude-all`
	250 strains were added back because they were in results/default-build/sample-all.txt
2 strains passed all filters

This suggests that the names in the sequences and metadata didn’t match, so augur filter couldn’t link the records and output them during the filtering process. The filter command would report the sequences that don’t have matching metadata records as having no metadata. The solution is to recreate the sequences and metadata files with matching strain names before running the workflow.

@limxr01 Can you share the complete command you are using to run your ncov workflow and the complete error output you’re getting, so we have a better sense of the problem?

jlhudd · November 1, 2024, 9:56pm

I forgot to mention that to @jbarnell’s original question, you need at least 3 sequences to build a phylogenetic tree, so IQ-TREE throws an error when it only finds the 2 reference sequences. This error is an internal sanity check for IQ-TREE. In this context, the error lets you know that something went wrong with the workflow upstream of the tree building step.

limxr01 · November 2, 2024, 6:08am

Hi @jlhudd, thank you for your response! It looks like I’m encountering both of the issues you mentioned: some strains are being dropped during filtering due to missing metadata, and IQ-TREE is throwing an error because it only has the two reference sequences to work with. I’ve attached my full command and error log in the .txt file for your reference.
failed_complete.txt (19.2 KB)

I’ll double-check my sequence and metadata files for any strain name mismatches as you suggested. If you have any further tips on resolving this, I’d appreciate it! Thank you so much for your help!

jlhudd · November 4, 2024, 5:53pm

Thank you for the log file, @limxr01; that helps a lot! It looks like by the time the workflow gets to the filtering rule (Job 6 in your log file), there is only one record in the sequences and metadata. This suggests something went wrong upstream between that filter step and the initial subsampling where 4334 passed the initial filters.

To get a better idea of when those records got dropped, can you share the full output of the following command run from the ncov directory?

ls -lt results/malaysia/

limxr01 · November 7, 2024, 6:06am

Topic		Replies	Views
Diagnosing error + filtering issues Help and Getting Started	14	1616	November 9, 2020
Iqtree error: Tree taxa and alignment sequence do not match	10	249	November 6, 2024
Regarding Build for USA- Missing Data Help and Getting Started	9	540	October 27, 2021
All samples dropped during augur filter	29	2206	January 24, 2022
IQTREE error: Some sequences (see above) are problematic, please check your alignment again	3	771	April 11, 2024

Error: Alignment must have at least 3 sequences

Related topics