Good afternoon!
I am running the classic ncov pipeline on my own data (300 genomes) and on https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz as a reference. Please tell me, is it normal for the pipeline to run for about two hours?? It seems like it’s taking too long.
Nextstrain-cli was installed according to the standard instructions for docker. Launch Command
mamba activate nextstrain-cli
nextstrain build . --configfile ncov-tutorial/custom-data.yaml --cores 144
System specifications:
centos-release-7-9.2009.1.el7.centos.x86_64, 144 threads, 2 TB RAM.
However, in htop, the load is very low. For example, only a few threads are involved during the execution of the augur refine command.
Hi @magletdinov,
2 hours might be expected. For reference, we run 29 builds in parallel on all 9 million samples with 72 CPUs and that takes ~2 hours.
There are two ways to inspect workflow run times:
-
If you use a version of the workflow earlier than v17, you can run with Snakemake’s --stats
and visualize the timing using Snakemake run stats.
-
You can inspect the files in benchmark/
which contain per-job details including run time. This is the only option on workflow version ≥v17.
Can you double check that you are indeed starting with the ncov/open/global dataset of just a few thousand sequences, not the full ncov/open dataset of 9 million samples? It would take some time to filter/subsample the full dataset.
TreeTime (the underlying tool used in augur refine
) cannot use multiple threads, so the low load is expected.
– Victor
Thank you very much for your answer! I double checked, I do use a shortened small dataset:
inputs:name: reference_data
metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz
sequences: https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz
name: custom_data
metadata: data/custom.metadata.tsv
sequences: data/custom.sequences.fasta
And my custom dataset contains 347 sequences.
I checked the logs in the benchmark folder:
awk ‘FNR>1 {print FILENAME “\t” $1 “\t” $3 “\t” $9}’ *_custom-build.txt | sort -k2 -nr | column -t
|
Time (s) |
Memory (MB) |
Disk (MB) |
|
|
|
|
tree_custom-build.txt |
4001.3575 |
4929.90 |
175.83 |
refine_custom-build.txt |
3554.2668 |
2203.08 |
99.43 |
ancestral_custom-build.txt |
305.3237 |
2597.22 |
93.65 |
aamuts_custom-build.txt |
227.6779 |
1139.29 |
98.48 |
traits_custom-build.txt |
211.3163 |
229.99 |
92.55 |
align_custom-build.txt |
35.0608 |
398.37 |
357.78 |
export_custom-build.txt |
26.7970 |
649.24 |
57.06 |
subsample_regions_custom-build.txt |
16.3503 |
844.79 |
203.65 |
filter_custom-build.txt |
12.3746 |
348.49 |
54.57 |
fix_colorings_custom-build.txt |
10.4294 |
80.62 |
75.94 |
mask_custom-build.txt |
9.4255 |
129.39 |
56.66 |
diagnostics_custom-build.txt |
9.3398 |
134.20 |
86.94 |
emerging_lineages_custom-build.txt |
8.6692 |
535.98 |
82.70 |
clades_custom-build.txt |
7.9151 |
535.64 |
85.52 |
index_sequences_custom-build.txt |
6.0348 |
156.64 |
85.84 |
tip_frequencies_custom-build.txt |
5.3848 |
183.86 |
87.28 |
mlr_lineage_fitness_custom-build.txt |
4.7147 |
581.71 |
66.00 |
adjust_metadata_regions_custom-build.txt |
4.0147 |
208.94 |
76.97 |
annotate_metadata_with_index_custom-build.txt |
3.4737 |
170.10 |
56.13 |
join_metadata_and_nextclade_qc_custom-build.txt |
2.8419 |
129.72 |
57.72 |
calculate_epiweeks_custom-build.txt |
2.3988 |
176.81 |
65.91 |
recency_custom-build.txt |
2.2634 |
172.15 |
70.34 |
colors_custom-build.txt |
1.5058 |
129.52 |
33.85 |
clade_files_custom-build.txt |
0.3138 |
5.99 |
0.00 |
Thus, the total pipeline execution time is more than 2 hours for less than 5500 samples.
It seems like this shouldn’t happen.
Thanks for confirming. I used your command to grab the timings for global_6m in our latest run, which should be roughly comparable after the subsample
step since it is the same sequences as in your reference_data
input. Here is a comparison table sorted by % change in timing.
rule |
global_6m |
custom-build |
% change |
traits |
35.4976 |
211.3163 |
495.30 |
tree |
1668.1000 |
4001.3575 |
139.88 |
mask |
3.9651 |
9.4255 |
137.71 |
clade_files |
0.1420 |
0.3138 |
120.99 |
join_metadata_and_nextclade_qc |
1.3403 |
2.8419 |
112.03 |
tip_frequencies |
2.9118 |
5.3848 |
84.93 |
annotate_metadata_with_index |
1.9053 |
3.4737 |
82.32 |
diagnostics |
5.7969 |
9.3398 |
61.12 |
filter |
8.1634 |
12.3746 |
51.59 |
fix_colorings |
7.6169 |
10.4294 |
36.92 |
clades |
6.0768 |
7.9151 |
30.25 |
refine |
2787.3943 |
3554.2668 |
27.51 |
export |
21.0506 |
26.7970 |
27.30 |
index_sequences |
4.9844 |
6.0348 |
21.07 |
align |
29.7773 |
35.0608 |
17.74 |
emerging_lineages |
7.9342 |
8.6692 |
9.26 |
aamuts |
237.7126 |
227.6779 |
-4.22 |
mlr_lineage_fitness |
5.0553 |
4.7147 |
-6.74 |
calculate_epiweeks |
2.6967 |
2.3988 |
-11.05 |
recency |
2.7168 |
2.2634 |
-16.69 |
adjust_metadata_regions |
4.8729 |
4.0147 |
-17.61 |
ancestral |
397.3776 |
305.3237 |
-23.17 |
colors |
3.1951 |
1.5058 |
-52.87 |
subsample_regions |
1228.6251 |
16.3503 |
-98.67 |
compress_build_align |
22.4404 |
- |
- |
deploy_single |
28.7834 |
- |
- |
make_auspice_config |
0.0230 |
- |
- |
Some increased time is expected since you are adding 347 sequences, but indeed there are a few differences that seem to be outliers. The biggest factor is the tree building step (rule tree
/augur tree
/IQ-TREE
) taking significantly more time in your run.
It might be helpful to run without the additional 347 sequences as a test. If the timings are similar to our global_6m, that means the increased time is due to the additional 347 sequences. Otherwise, if it takes just as long as what you’re seeing now, I would guess that the slowness is caused by some difference in installation or hardware. Can you provide installation details by pasting the output of nextstrain version --verbose
?
Yes, sure.
I installed nextstrain like this: Installation — Nextstrain CLI 10.2.1.post1 documentation
mamba install nextstrain-cli
-c conda-forge -c bioconda
–strict-channel-priority
–override-channels
And then according to: Installing Nextstrain — Nextstrain documentation
nextstrain setup --set-default docker
The pipeline was launched via:
nextstrain shell .
nextstrain build . --configfile ncov-tutorial/custom-data.yaml --cores 144
When typing nextstrain version --verbose outside an interactive session (immediately after mamba activate nexstrain-cli):
nextstrain version --verbose
Nextstrain CLI 10.2.1.post1
Python
/export/home/user/mambaforge/envs/nextstrain-cli/bin/python3.11
3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0]
Runtimes
docker (default)
nextstrain/base:build-20250721T201347Z (ee93a065f54a, 2025-07-21 23:52:58)
August 31.3.0
auspice v2.63.1
fauna 1e0c4e2
conda
nextstrain-base unknown
singularity
docker://nextstrain/base (not present)
ambient
unknown
aws-batch
unknown
Pathogens
(none)
And if I run a command inside an interactive session (after nextstrain shell .):
nextstrain version --verbose
Nextstrain CLI 10.2.1.post1
Python
/usr/local/bin/python3.11
3.11.13 (main, Jul 1 2025, 02:42:16) [GCC 12.2.0]
Runtimes
docker
unknown
conda
nextstrain-base unknown
singularity
docker://nextstrain/base (not present)
ambient (default)
augur 31.3.0
auspice 2.63.1
aws-batch
unknown
Pathogens
(none)
Thanks for your help! Now I’ve launched the pipeline with reference data only, without custom ones.
Thanks for the info. Those outputs indicate that you’re using the latest version of Nextstrain CLI and the Docker image, which is good. Hopefully the test run will provide more insight.
|
B |
C |
D |
|
Time (s) |
Memory (MB) |
Disk (MB) |
tree_default-build.txt |
4761.3668 |
4449.90 |
191.98 |
refine_default-build.txt |
3409.1616 |
2084.93 |
99.22 |
ancestral_default-build.txt |
288.4192 |
2366.36 |
95.84 |
aamuts_default-build.txt |
228.8086 |
1067.48 |
97.88 |
traits_default-build.txt |
224.4134 |
221.28 |
87.11 |
align_default-build.txt |
32.0781 |
408.66 |
386.56 |
export_default-build.txt |
27.1918 |
576.07 |
55.39 |
subsample_regions_default-build.txt |
16.4839 |
850.14 |
194.43 |
filter_default-build.txt |
10.5544 |
348.47 |
54.36 |
fix_colorings_default-build.txt |
10.3670 |
93.59 |
79.39 |
mask_default-build.txt |
9.1379 |
129.18 |
56.46 |
emerging_lineages_default-build.txt |
8.8716 |
526.22 |
84.54 |
diagnostics_default-build.txt |
8.7821 |
112.10 |
87.34 |
clades_default-build.txt |
7.9482 |
526.15 |
83.67 |
index_sequences_default-build.txt |
5.8465 |
156.56 |
89.29 |
mlr_lineage_fitness_default-build.txt |
4.8670 |
560.36 |
73.56 |
tip_frequencies_default-build.txt |
4.5644 |
183.50 |
80.38 |
adjust_metadata_regions_default-build.txt |
3.1294 |
197.70 |
66.46 |
annotate_metadata_with_index_default-build.txt |
2.7356 |
148.29 |
56.29 |
join_metadata_and_nextclade_qc_default-build.txt |
2.3916 |
114.09 |
44.28 |
calculate_epiweeks_default-build.txt |
2.1851 |
176.95 |
70.50 |
recency_default-build.txt |
2.0782 |
170.48 |
50.44 |
colors_default-build.txt |
1.4548 |
128.21 |
35.04 |
clade_files_default-build.txt |
0.3357 |
0.98 |
0.00 |
The calculation is finished, I also provide my config:
inputs:
name: reference_data
metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz
sequences: https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz
refine:root: “Wuhan-Hu-1/2019”
Great, thank you for the update. Since these timings are similar to your previous run with additional sequences, I suspect this is due to installation or hardware differences.
Let’s focus on the tree building slowness. IQ-TREE in the Docker image is a direct download of the official pre-built binary. I have 2 alternatives. First, you can try using the Conda runtime, which uses an independently built IQ-TREE binary (via Bioconda). You can do this by running nextstrain setup --set-default conda
then retrying the workflow. If that doesn’t solve the issue, I can update IQ-TREE in our Docker image to a more recent version to see if that helps.
I’ve created a Docker image with IQ-TREE v3.0.1 that you can use to test:
nextstrain build --image nextstrain/base:branch-victorlin-update-iqtree . [snakemake options]
Note: for this you must use the nextstrain
command on your host, not the one within nextstrain shell
.
Thank you very much! I tried to use conda instead of docker, but I was getting an error related to my version of __glibc:
The following package could not be installed
***└─ nextstrain-base 20250729T235918Z is not installable because it requires***
***└─ tsv-utils 2.2.3 hd68e0f1_8, which requires***
***└─ \__glibc >=2.28,<3.0.a0 , which is missing on the system.***
critical libmamba Could not solve for environment specs.
I have asked servers admin about with issue, and he told me, it is impossible to update __glibc.
Thus i have tried a second approach with updated iqtree.
Results:
|
Time (s) |
Memory (MB) |
Disk (MB) |
tree_default-build.txt |
3463.9374 |
5749.74 |
162.79 |
refine_default-build.txt |
3455.6601 |
2121.14 |
99.60 |
ancestral_default-build.txt |
285.0744 |
2345.63 |
89.72 |
aamuts_default-build.txt |
240.2425 |
1045.83 |
93.19 |
traits_default-build.txt |
198.4825 |
220.01 |
98.53 |
align_default-build.txt |
33.2948 |
397.04 |
373.93 |
export_default-build.txt |
29.2422 |
610.95 |
51.98 |
subsample_regions_default-build.txt |
16.3662 |
873.46 |
202.67 |
filter_default-build.txt |
12.7170 |
346.88 |
48.06 |
fix_colorings_default-build.txt |
12.2880 |
80.05 |
78.70 |
mask_default-build.txt |
9.8085 |
131.58 |
56.68 |
emerging_lineages_default-build.txt |
8.8931 |
529.06 |
87.70 |
diagnostics_default-build.txt |
8.6254 |
106.68 |
88.23 |
clades_default-build.txt |
7.9315 |
529.62 |
82.95 |
index_sequences_default-build.txt |
6.0268 |
156.98 |
84.81 |
mlr_lineage_fitness_default-build.txt |
4.8771 |
499.26 |
67.10 |
tip_frequencies_default-build.txt |
4.7400 |
185.51 |
77.41 |
adjust_metadata_regions_default-build.txt |
3.0180 |
198.81 |
67.94 |
annotate_metadata_with_index_default-build.txt |
2.8048 |
146.61 |
54.19 |
calculate_epiweeks_default-build.txt |
2.3805 |
174.00 |
65.09 |
recency_default-build.txt |
2.3386 |
171.00 |
64.16 |
join_metadata_and_nextclade_qc_default-build.txt |
2.1682 |
116.27 |
49.37 |
colors_default-build.txt |
1.4145 |
111.43 |
36.04 |
clade_files_default-build.txt |
0.2747 |
1.23 |
0.00 |
The total improvement in time was 16 percent. But it steel probably so long.
But it’s probably still too long.
Thanks for the update. It’s unfortunate that that Docker is still slow even with the newer version of IQ-TREE. There is another alternative to the Nextstrain CLI conda runtime that should work with older versions of glibc. You can install the packages in your own conda environment and use the ambient runtime:
mamba activate nextstrain-cli
mamba install -c conda-forge -c bioconda -c nextstrain nextstrain-base
# or
mamba install -c conda-forge -c bioconda --yes \
augur auspice nextclade \
snakemake git epiweeks \
ncbi-datasets-cli csvtk seqkit tsv-utils
nextstrain build --ambient . [snakemake options]
If you use the second mamba install command which lists all packages instead of using the nextstrain-base meta-package you should be able to get things installed. I’ll see why I bumped the glibc for tsv-utils (I maintain the feedstock)
The latest version of the Conda runtime supports glibc 2.17, so you can try this again: nextstrain setup --set-default conda