The pipeline takes a very long time to complete

magletdinov · July 29, 2025, 1:26pm

Good afternoon!
I am running the classic ncov pipeline on my own data (300 genomes) and on https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz as a reference. Please tell me, is it normal for the pipeline to run for about two hours?? It seems like it’s taking too long.
Nextstrain-cli was installed according to the standard instructions for docker. Launch Command
mamba activate nextstrain-cli
nextstrain build . --configfile ncov-tutorial/custom-data.yaml --cores 144

System specifications:
centos-release-7-9.2009.1.el7.centos.x86_64, 144 threads, 2 TB RAM.
However, in htop, the load is very low. For example, only a few threads are involved during the execution of the augur refine command.

victorlin · July 30, 2025, 12:24am

Hi @magletdinov,

2 hours might be expected. For reference, we run 29 builds in parallel on all 9 million samples with 72 CPUs and that takes ~2 hours.

There are two ways to inspect workflow run times:

If you use a version of the workflow earlier than v17, you can run with Snakemake’s --stats and visualize the timing using Snakemake run stats.
You can inspect the files in benchmark/ which contain per-job details including run time. This is the only option on workflow version ≥v17.

Can you double check that you are indeed starting with the ncov/open/global dataset of just a few thousand sequences, not the full ncov/open dataset of 9 million samples? It would take some time to filter/subsample the full dataset.

TreeTime (the underlying tool used in augur refine) cannot use multiple threads, so the low load is expected.

– Victor

magletdinov · July 30, 2025, 10:13am

Thank you very much for your answer! I double checked, I do use a shortened small dataset:

inputs:name: reference_data

metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz

sequences: https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz

name: custom_data

metadata: data/custom.metadata.tsv

sequences: data/custom.sequences.fasta

And my custom dataset contains 347 sequences.

I checked the logs in the benchmark folder:
awk ‘FNR>1 {print FILENAME “\t” $1 “\t” $3 “\t” $9}’ *_custom-build.txt | sort -k2 -nr | column -t

	Time (s)	Memory (MB)	Disk (MB)

tree_custom-build.txt	4001.3575	4929.90	175.83
refine_custom-build.txt	3554.2668	2203.08	99.43
ancestral_custom-build.txt	305.3237	2597.22	93.65
aamuts_custom-build.txt	227.6779	1139.29	98.48
traits_custom-build.txt	211.3163	229.99	92.55
align_custom-build.txt	35.0608	398.37	357.78
export_custom-build.txt	26.7970	649.24	57.06
subsample_regions_custom-build.txt	16.3503	844.79	203.65
filter_custom-build.txt	12.3746	348.49	54.57
fix_colorings_custom-build.txt	10.4294	80.62	75.94
mask_custom-build.txt	9.4255	129.39	56.66
diagnostics_custom-build.txt	9.3398	134.20	86.94
emerging_lineages_custom-build.txt	8.6692	535.98	82.70
clades_custom-build.txt	7.9151	535.64	85.52
index_sequences_custom-build.txt	6.0348	156.64	85.84
tip_frequencies_custom-build.txt	5.3848	183.86	87.28
mlr_lineage_fitness_custom-build.txt	4.7147	581.71	66.00
adjust_metadata_regions_custom-build.txt	4.0147	208.94	76.97
annotate_metadata_with_index_custom-build.txt	3.4737	170.10	56.13
join_metadata_and_nextclade_qc_custom-build.txt	2.8419	129.72	57.72
calculate_epiweeks_custom-build.txt	2.3988	176.81	65.91
recency_custom-build.txt	2.2634	172.15	70.34
colors_custom-build.txt	1.5058	129.52	33.85
clade_files_custom-build.txt	0.3138	5.99	0.00

Thus, the total pipeline execution time is more than 2 hours for less than 5500 samples.

It seems like this shouldn’t happen.

victorlin · July 30, 2025, 6:15pm

Thanks for confirming. I used your command to grab the timings for global_6m in our latest run, which should be roughly comparable after the subsample step since it is the same sequences as in your reference_data input. Here is a comparison table sorted by % change in timing.

rule	global_6m	custom-build	% change
traits	35.4976	211.3163	495.30
tree	1668.1000	4001.3575	139.88
mask	3.9651	9.4255	137.71
clade_files	0.1420	0.3138	120.99
join_metadata_and_nextclade_qc	1.3403	2.8419	112.03
tip_frequencies	2.9118	5.3848	84.93
annotate_metadata_with_index	1.9053	3.4737	82.32
diagnostics	5.7969	9.3398	61.12
filter	8.1634	12.3746	51.59
fix_colorings	7.6169	10.4294	36.92
clades	6.0768	7.9151	30.25
refine	2787.3943	3554.2668	27.51
export	21.0506	26.7970	27.30
index_sequences	4.9844	6.0348	21.07
align	29.7773	35.0608	17.74
emerging_lineages	7.9342	8.6692	9.26
aamuts	237.7126	227.6779	-4.22
mlr_lineage_fitness	5.0553	4.7147	-6.74
calculate_epiweeks	2.6967	2.3988	-11.05
recency	2.7168	2.2634	-16.69
adjust_metadata_regions	4.8729	4.0147	-17.61
ancestral	397.3776	305.3237	-23.17
colors	3.1951	1.5058	-52.87
subsample_regions	1228.6251	16.3503	-98.67
compress_build_align	22.4404	-	-
deploy_single	28.7834	-	-
make_auspice_config	0.0230	-	-

Some increased time is expected since you are adding 347 sequences, but indeed there are a few differences that seem to be outliers. The biggest factor is the tree building step (rule tree/augur tree/IQ-TREE) taking significantly more time in your run.

It might be helpful to run without the additional 347 sequences as a test. If the timings are similar to our global_6m, that means the increased time is due to the additional 347 sequences. Otherwise, if it takes just as long as what you’re seeing now, I would guess that the slowness is caused by some difference in installation or hardware. Can you provide installation details by pasting the output of nextstrain version --verbose?

magletdinov · July 30, 2025, 7:08pm

Yes, sure.
I installed nextstrain like this: Installation — Nextstrain CLI 10.2.1.post1 documentation
mamba install nextstrain-cli
-c conda-forge -c bioconda
–strict-channel-priority
–override-channels

And then according to: Installing Nextstrain — Nextstrain documentation
nextstrain setup --set-default docker

The pipeline was launched via:
nextstrain shell .
nextstrain build . --configfile ncov-tutorial/custom-data.yaml --cores 144

When typing nextstrain version --verbose outside an interactive session (immediately after mamba activate nexstrain-cli):
nextstrain version --verbose
Nextstrain CLI 10.2.1.post1

Python
/export/home/user/mambaforge/envs/nextstrain-cli/bin/python3.11
3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0]

Runtimes
docker (default)
nextstrain/base:build-20250721T201347Z (ee93a065f54a, 2025-07-21 23:52:58)
August 31.3.0
auspice v2.63.1
fauna 1e0c4e2

conda
nextstrain-base unknown

singularity
docker://nextstrain/base (not present)

ambient
unknown

aws-batch
unknown

Pathogens
(none)

And if I run a command inside an interactive session (after nextstrain shell .):
nextstrain version --verbose
Nextstrain CLI 10.2.1.post1

Python
/usr/local/bin/python3.11
3.11.13 (main, Jul 1 2025, 02:42:16) [GCC 12.2.0]

Runtimes
docker
unknown

conda
nextstrain-base unknown

singularity
docker://nextstrain/base (not present)

ambient (default)
augur 31.3.0
auspice 2.63.1

aws-batch
unknown

Pathogens
(none)

Thanks for your help! Now I’ve launched the pipeline with reference data only, without custom ones.

victorlin · July 30, 2025, 8:58pm

Thanks for the info. Those outputs indicate that you’re using the latest version of Nextstrain CLI and the Docker image, which is good. Hopefully the test run will provide more insight.

magletdinov · July 30, 2025, 9:56pm

	B	C	D
	Time (s)	Memory (MB)	Disk (MB)
tree_default-build.txt	4761.3668	4449.90	191.98
refine_default-build.txt	3409.1616	2084.93	99.22
ancestral_default-build.txt	288.4192	2366.36	95.84
aamuts_default-build.txt	228.8086	1067.48	97.88
traits_default-build.txt	224.4134	221.28	87.11
align_default-build.txt	32.0781	408.66	386.56
export_default-build.txt	27.1918	576.07	55.39
subsample_regions_default-build.txt	16.4839	850.14	194.43
filter_default-build.txt	10.5544	348.47	54.36
fix_colorings_default-build.txt	10.3670	93.59	79.39
mask_default-build.txt	9.1379	129.18	56.46
emerging_lineages_default-build.txt	8.8716	526.22	84.54
diagnostics_default-build.txt	8.7821	112.10	87.34
clades_default-build.txt	7.9482	526.15	83.67
index_sequences_default-build.txt	5.8465	156.56	89.29
mlr_lineage_fitness_default-build.txt	4.8670	560.36	73.56
tip_frequencies_default-build.txt	4.5644	183.50	80.38
adjust_metadata_regions_default-build.txt	3.1294	197.70	66.46
annotate_metadata_with_index_default-build.txt	2.7356	148.29	56.29
join_metadata_and_nextclade_qc_default-build.txt	2.3916	114.09	44.28
calculate_epiweeks_default-build.txt	2.1851	176.95	70.50
recency_default-build.txt	2.0782	170.48	50.44
colors_default-build.txt	1.4548	128.21	35.04
clade_files_default-build.txt	0.3357	0.98	0.00

The calculation is finished, I also provide my config:

inputs:

name: reference_data

metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz

sequences: https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz

refine:root: “Wuhan-Hu-1/2019”

victorlin · July 30, 2025, 10:39pm

Great, thank you for the update. Since these timings are similar to your previous run with additional sequences, I suspect this is due to installation or hardware differences.

Let’s focus on the tree building slowness. IQ-TREE in the Docker image is a direct download of the official pre-built binary. I have 2 alternatives. First, you can try using the Conda runtime, which uses an independently built IQ-TREE binary (via Bioconda). You can do this by running nextstrain setup --set-default conda then retrying the workflow. If that doesn’t solve the issue, I can update IQ-TREE in our Docker image to a more recent version to see if that helps.

victorlin · July 31, 2025, 5:19pm

I’ve created a Docker image with IQ-TREE v3.0.1 that you can use to test:

nextstrain build --image nextstrain/base:branch-victorlin-update-iqtree . [snakemake options]

Note: for this you must use the nextstrain command on your host, not the one within nextstrain shell.

magletdinov · August 4, 2025, 10:48am

Thank you very much! I tried to use conda instead of docker, but I was getting an error related to my version of __glibc:

The following package could not be installed

***└─ nextstrain-base 20250729T235918Z  is not installable because it requires***

   ***└─ tsv-utils 2.2.3 hd68e0f1_8, which requires***

      ***└─ \__glibc >=2.28,<3.0.a0 , which is missing on the system.***
critical libmamba Could not solve for environment specs.

I have asked servers admin about with issue, and he told me, it is impossible to update __glibc.
Thus i have tried a second approach with updated iqtree.
Results:

	Time (s)	Memory (MB)	Disk (MB)
tree_default-build.txt	3463.9374	5749.74	162.79
refine_default-build.txt	3455.6601	2121.14	99.60
ancestral_default-build.txt	285.0744	2345.63	89.72
aamuts_default-build.txt	240.2425	1045.83	93.19
traits_default-build.txt	198.4825	220.01	98.53
align_default-build.txt	33.2948	397.04	373.93
export_default-build.txt	29.2422	610.95	51.98
subsample_regions_default-build.txt	16.3662	873.46	202.67
filter_default-build.txt	12.7170	346.88	48.06
fix_colorings_default-build.txt	12.2880	80.05	78.70
mask_default-build.txt	9.8085	131.58	56.68
emerging_lineages_default-build.txt	8.8931	529.06	87.70
diagnostics_default-build.txt	8.6254	106.68	88.23
clades_default-build.txt	7.9315	529.62	82.95
index_sequences_default-build.txt	6.0268	156.98	84.81
mlr_lineage_fitness_default-build.txt	4.8771	499.26	67.10
tip_frequencies_default-build.txt	4.7400	185.51	77.41
adjust_metadata_regions_default-build.txt	3.0180	198.81	67.94
annotate_metadata_with_index_default-build.txt	2.8048	146.61	54.19
calculate_epiweeks_default-build.txt	2.3805	174.00	65.09
recency_default-build.txt	2.3386	171.00	64.16
join_metadata_and_nextclade_qc_default-build.txt	2.1682	116.27	49.37
colors_default-build.txt	1.4145	111.43	36.04
clade_files_default-build.txt	0.2747	1.23	0.00

The total improvement in time was 16 percent. But it steel probably so long.
But it’s probably still too long.

victorlin · August 5, 2025, 12:17am

Thanks for the update. It’s unfortunate that that Docker is still slow even with the newer version of IQ-TREE. There is another alternative to the Nextstrain CLI conda runtime that should work with older versions of glibc. You can install the packages in your own conda environment and use the ambient runtime:

mamba activate nextstrain-cli

mamba install -c conda-forge -c bioconda -c nextstrain nextstrain-base
# or
mamba install -c conda-forge -c bioconda --yes \
      augur auspice nextclade \
      snakemake git epiweeks \
      ncbi-datasets-cli csvtk seqkit tsv-utils

nextstrain build --ambient . [snakemake options]

corneliusroemer · August 8, 2025, 7:33pm

If you use the second mamba install command which lists all packages instead of using the nextstrain-base meta-package you should be able to get things installed. I’ll see why I bumped the glibc for tsv-utils (I maintain the feedstock)

victorlin · August 11, 2025, 7:17pm

The latest version of the Conda runtime supports glibc 2.17, so you can try this again: nextstrain setup --set-default conda

Topic		Replies	Views
Run basic analysis on example data Help and Getting Started	15	2040	June 30, 2020
Trouble with SARS_CoV_2 Help and Getting Started	38	162	February 10, 2025
Regarding Build for USA- Missing Data Help and Getting Started	9	541	October 27, 2021
Followed data prep instructions, nextstrain fails Help and Getting Started	20	823	December 16, 2021
Error message executing new tutorial Help and Getting Started	11	1623	July 16, 2020

The pipeline takes a very long time to complete

Related topics