The pipeline takes a very long time to complete

Good afternoon!
I am running the classic ncov pipeline on my own data (300 genomes) and on https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz as a reference. Please tell me, is it normal for the pipeline to run for about two hours?? It seems like it’s taking too long.
Nextstrain-cli was installed according to the standard instructions for docker. Launch Command
mamba activate nextstrain-cli
nextstrain build . --configfile ncov-tutorial/custom-data.yaml --cores 144

System specifications:
centos-release-7-9.2009.1.el7.centos.x86_64, 144 threads, 2 TB RAM.
However, in htop, the load is very low. For example, only a few threads are involved during the execution of the augur refine command.

Hi @magletdinov,

2 hours might be expected. For reference, we run 29 builds in parallel on all 9 million samples with 72 CPUs and that takes ~2 hours.

There are two ways to inspect workflow run times:

  1. If you use a version of the workflow earlier than v17, you can run with Snakemake’s --stats and visualize the timing using Snakemake run stats.

  2. You can inspect the files in benchmark/ which contain per-job details including run time. This is the only option on workflow version ≥v17.

Can you double check that you are indeed starting with the ncov/open/global dataset of just a few thousand sequences, not the full ncov/open dataset of 9 million samples? It would take some time to filter/subsample the full dataset.

TreeTime (the underlying tool used in augur refine) cannot use multiple threads, so the low load is expected.

– Victor

Thank you very much for your answer! I double checked, I do use a shortened small dataset:

inputs:name: reference_data

metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz

sequences: https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz

name: custom_data

metadata: data/custom.metadata.tsv

sequences: data/custom.sequences.fasta

And my custom dataset contains 347 sequences.

I checked the logs in the benchmark folder:
awk ‘FNR>1 {print FILENAME “\t” $1 “\t” $3 “\t” $9}’ *_custom-build.txt | sort -k2 -nr | column -t

Time (s) Memory (MB) Disk (MB)
tree_custom-build.txt 4001.3575 4929.90 175.83
refine_custom-build.txt 3554.2668 2203.08 99.43
ancestral_custom-build.txt 305.3237 2597.22 93.65
aamuts_custom-build.txt 227.6779 1139.29 98.48
traits_custom-build.txt 211.3163 229.99 92.55
align_custom-build.txt 35.0608 398.37 357.78
export_custom-build.txt 26.7970 649.24 57.06
subsample_regions_custom-build.txt 16.3503 844.79 203.65
filter_custom-build.txt 12.3746 348.49 54.57
fix_colorings_custom-build.txt 10.4294 80.62 75.94
mask_custom-build.txt 9.4255 129.39 56.66
diagnostics_custom-build.txt 9.3398 134.20 86.94
emerging_lineages_custom-build.txt 8.6692 535.98 82.70
clades_custom-build.txt 7.9151 535.64 85.52
index_sequences_custom-build.txt 6.0348 156.64 85.84
tip_frequencies_custom-build.txt 5.3848 183.86 87.28
mlr_lineage_fitness_custom-build.txt 4.7147 581.71 66.00
adjust_metadata_regions_custom-build.txt 4.0147 208.94 76.97
annotate_metadata_with_index_custom-build.txt 3.4737 170.10 56.13
join_metadata_and_nextclade_qc_custom-build.txt 2.8419 129.72 57.72
calculate_epiweeks_custom-build.txt 2.3988 176.81 65.91
recency_custom-build.txt 2.2634 172.15 70.34
colors_custom-build.txt 1.5058 129.52 33.85
clade_files_custom-build.txt 0.3138 5.99 0.00

Thus, the total pipeline execution time is more than 2 hours for less than 5500 samples.

It seems like this shouldn’t happen.

Thanks for confirming. I used your command to grab the timings for global_6m in our latest run, which should be roughly comparable after the subsample step since it is the same sequences as in your reference_data input. Here is a comparison table sorted by % change in timing.

rule global_6m custom-build % change
traits 35.4976 211.3163 495.30
tree 1668.1000 4001.3575 139.88
mask 3.9651 9.4255 137.71
clade_files 0.1420 0.3138 120.99
join_metadata_and_nextclade_qc 1.3403 2.8419 112.03
tip_frequencies 2.9118 5.3848 84.93
annotate_metadata_with_index 1.9053 3.4737 82.32
diagnostics 5.7969 9.3398 61.12
filter 8.1634 12.3746 51.59
fix_colorings 7.6169 10.4294 36.92
clades 6.0768 7.9151 30.25
refine 2787.3943 3554.2668 27.51
export 21.0506 26.7970 27.30
index_sequences 4.9844 6.0348 21.07
align 29.7773 35.0608 17.74
emerging_lineages 7.9342 8.6692 9.26
aamuts 237.7126 227.6779 -4.22
mlr_lineage_fitness 5.0553 4.7147 -6.74
calculate_epiweeks 2.6967 2.3988 -11.05
recency 2.7168 2.2634 -16.69
adjust_metadata_regions 4.8729 4.0147 -17.61
ancestral 397.3776 305.3237 -23.17
colors 3.1951 1.5058 -52.87
subsample_regions 1228.6251 16.3503 -98.67
compress_build_align 22.4404 - -
deploy_single 28.7834 - -
make_auspice_config 0.0230 - -

Some increased time is expected since you are adding 347 sequences, but indeed there are a few differences that seem to be outliers. The biggest factor is the tree building step (rule tree/augur tree/IQ-TREE) taking significantly more time in your run.

It might be helpful to run without the additional 347 sequences as a test. If the timings are similar to our global_6m, that means the increased time is due to the additional 347 sequences. Otherwise, if it takes just as long as what you’re seeing now, I would guess that the slowness is caused by some difference in installation or hardware. Can you provide installation details by pasting the output of nextstrain version --verbose?

Yes, sure.
I installed nextstrain like this: Installation — Nextstrain CLI 10.2.1.post1 documentation
mamba install nextstrain-cli
-c conda-forge -c bioconda
–strict-channel-priority
–override-channels

And then according to: Installing Nextstrain — Nextstrain documentation
nextstrain setup --set-default docker

The pipeline was launched via:
nextstrain shell .
nextstrain build . --configfile ncov-tutorial/custom-data.yaml --cores 144

When typing nextstrain version --verbose outside an interactive session (immediately after mamba activate nexstrain-cli):
nextstrain version --verbose
Nextstrain CLI 10.2.1.post1

Python
/export/home/user/mambaforge/envs/nextstrain-cli/bin/python3.11
3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0]

Runtimes
docker (default)
nextstrain/base:build-20250721T201347Z (ee93a065f54a, 2025-07-21 23:52:58)
August 31.3.0
auspice v2.63.1
fauna 1e0c4e2

conda
nextstrain-base unknown

singularity
docker://nextstrain/base (not present)

ambient
unknown

aws-batch
unknown

Pathogens
(none)

And if I run a command inside an interactive session (after nextstrain shell .):
nextstrain version --verbose
Nextstrain CLI 10.2.1.post1

Python
/usr/local/bin/python3.11
3.11.13 (main, Jul 1 2025, 02:42:16) [GCC 12.2.0]

Runtimes
docker
unknown

conda
nextstrain-base unknown

singularity
docker://nextstrain/base (not present)

ambient (default)
augur 31.3.0
auspice 2.63.1

aws-batch
unknown

Pathogens
(none)

Thanks for your help! Now I’ve launched the pipeline with reference data only, without custom ones.

Thanks for the info. Those outputs indicate that you’re using the latest version of Nextstrain CLI and the Docker image, which is good. Hopefully the test run will provide more insight.

B C D
Time (s) Memory (MB) Disk (MB)
tree_default-build.txt 4761.3668 4449.90 191.98
refine_default-build.txt 3409.1616 2084.93 99.22
ancestral_default-build.txt 288.4192 2366.36 95.84
aamuts_default-build.txt 228.8086 1067.48 97.88
traits_default-build.txt 224.4134 221.28 87.11
align_default-build.txt 32.0781 408.66 386.56
export_default-build.txt 27.1918 576.07 55.39
subsample_regions_default-build.txt 16.4839 850.14 194.43
filter_default-build.txt 10.5544 348.47 54.36
fix_colorings_default-build.txt 10.3670 93.59 79.39
mask_default-build.txt 9.1379 129.18 56.46
emerging_lineages_default-build.txt 8.8716 526.22 84.54
diagnostics_default-build.txt 8.7821 112.10 87.34
clades_default-build.txt 7.9482 526.15 83.67
index_sequences_default-build.txt 5.8465 156.56 89.29
mlr_lineage_fitness_default-build.txt 4.8670 560.36 73.56
tip_frequencies_default-build.txt 4.5644 183.50 80.38
adjust_metadata_regions_default-build.txt 3.1294 197.70 66.46
annotate_metadata_with_index_default-build.txt 2.7356 148.29 56.29
join_metadata_and_nextclade_qc_default-build.txt 2.3916 114.09 44.28
calculate_epiweeks_default-build.txt 2.1851 176.95 70.50
recency_default-build.txt 2.0782 170.48 50.44
colors_default-build.txt 1.4548 128.21 35.04
clade_files_default-build.txt 0.3357 0.98 0.00

The calculation is finished, I also provide my config:

inputs:

name: reference_data

metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz

sequences: https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz

refine:root: “Wuhan-Hu-1/2019”

Great, thank you for the update. Since these timings are similar to your previous run with additional sequences, I suspect this is due to installation or hardware differences.

Let’s focus on the tree building slowness. IQ-TREE in the Docker image is a direct download of the official pre-built binary. I have 2 alternatives. First, you can try using the Conda runtime, which uses an independently built IQ-TREE binary (via Bioconda). You can do this by running nextstrain setup --set-default conda then retrying the workflow. If that doesn’t solve the issue, I can update IQ-TREE in our Docker image to a more recent version to see if that helps.

I’ve created a Docker image with IQ-TREE v3.0.1 that you can use to test:

nextstrain build --image nextstrain/base:branch-victorlin-update-iqtree . [snakemake options]

Note: for this you must use the nextstrain command on your host, not the one within nextstrain shell.

Thank you very much! I tried to use conda instead of docker, but I was getting an error related to my version of __glibc:

The following package could not be installed

***└─ nextstrain-base 20250729T235918Z  is not installable because it requires***

   ***└─ tsv-utils 2.2.3 hd68e0f1_8, which requires***

      ***└─ \__glibc >=2.28,<3.0.a0 , which is missing on the system.***
critical libmamba Could not solve for environment specs.

I have asked servers admin about with issue, and he told me, it is impossible to update __glibc.
Thus i have tried a second approach with updated iqtree.
Results:

Time (s) Memory (MB) Disk (MB)
tree_default-build.txt 3463.9374 5749.74 162.79
refine_default-build.txt 3455.6601 2121.14 99.60
ancestral_default-build.txt 285.0744 2345.63 89.72
aamuts_default-build.txt 240.2425 1045.83 93.19
traits_default-build.txt 198.4825 220.01 98.53
align_default-build.txt 33.2948 397.04 373.93
export_default-build.txt 29.2422 610.95 51.98
subsample_regions_default-build.txt 16.3662 873.46 202.67
filter_default-build.txt 12.7170 346.88 48.06
fix_colorings_default-build.txt 12.2880 80.05 78.70
mask_default-build.txt 9.8085 131.58 56.68
emerging_lineages_default-build.txt 8.8931 529.06 87.70
diagnostics_default-build.txt 8.6254 106.68 88.23
clades_default-build.txt 7.9315 529.62 82.95
index_sequences_default-build.txt 6.0268 156.98 84.81
mlr_lineage_fitness_default-build.txt 4.8771 499.26 67.10
tip_frequencies_default-build.txt 4.7400 185.51 77.41
adjust_metadata_regions_default-build.txt 3.0180 198.81 67.94
annotate_metadata_with_index_default-build.txt 2.8048 146.61 54.19
calculate_epiweeks_default-build.txt 2.3805 174.00 65.09
recency_default-build.txt 2.3386 171.00 64.16
join_metadata_and_nextclade_qc_default-build.txt 2.1682 116.27 49.37
colors_default-build.txt 1.4145 111.43 36.04
clade_files_default-build.txt 0.2747 1.23 0.00

The total improvement in time was 16 percent. But it steel probably so long.
But it’s probably still too long.

Thanks for the update. It’s unfortunate that that Docker is still slow even with the newer version of IQ-TREE. There is another alternative to the Nextstrain CLI conda runtime that should work with older versions of glibc. You can install the packages in your own conda environment and use the ambient runtime:

mamba activate nextstrain-cli

mamba install -c conda-forge -c bioconda -c nextstrain nextstrain-base
# or
mamba install -c conda-forge -c bioconda --yes \
      augur auspice nextclade \
      snakemake git epiweeks \
      ncbi-datasets-cli csvtk seqkit tsv-utils

nextstrain build --ambient . [snakemake options]

If you use the second mamba install command which lists all packages instead of using the nextstrain-base meta-package you should be able to get things installed. I’ll see why I bumped the glibc for tsv-utils (I maintain the feedstock)

The latest version of the Conda runtime supports glibc 2.17, so you can try this again: nextstrain setup --set-default conda