Error in rule sanitize_metadata: ncov workflow

Hello everyone,

I’m running nextstrain job on local data, the metadata.tsv and sequences.fasta and the builds.yaml was constructed based on the ncov tutorial,

First, I run the pipeline using this command:

  $ nextstrain build . --configfile builds.yaml --cores 4 -p
  I got this error:
Error in rule sanitize_metadata:
Finished job 11.
    jobid: 18
4 of 39 steps (10%) done
    output: results/sanitized_metadata_refrences.tsv.xz
    log: logs/sanitize_metadata_refrences.txt (check log file(s) for error message)
    shell:

        python3 scripts/sanitize_metadata.py             --metadata data/references_metadata.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_refrences.tsv.xz 2>&1 | tee logs/sanitize_metadata_refrences.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Sun Oct 31 13:11:09 2021]
Error in rule sanitize_metadata:
    jobid: 13
    output: results/sanitized_metadata.tsv.xz
    log: logs/sanitize_metadata.txt (check log file(s) for error message)
    shell:

        python3 scripts/sanitize_metadata.py             --metadata data/metadata.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata.tsv.xz 2>&1 | tee logs/sanitize_metadata.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Sun Oct 31 13:11:09 2021]
Finished job 29.
5 of 39 steps (13%) done
Traceback (most recent call last):
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 653, in touch
    lutime(self.file, times)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 67, in lutime
    os.utime(f, times, follow_symlinks=False)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 667, in touch_or_create
    self.touch()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 656, in touch
    raise MissingOutputException(
snakemake.exceptions.MissingOutputException: Job Output file logs/sanitize_metadata_refrences.txt of rule sanitize_metadata shall be touched but does not exist. completed successfully, but some output files are missing.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/__init__.py", line 699, in snakemake
    success = workflow.execute(
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/workflow.py", line 1069, in execute
    success = self.scheduler.schedule()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/scheduler.py", line 441, in schedule
    self._error_jobs()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/scheduler.py", line 557, in _error_jobs
    self._handle_error(job)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/scheduler.py", line 614, in _handle_error
    self.get_executor(job).handle_job_error(job)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 611, in handle_job_error
    super().handle_job_error(job)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 277, in handle_job_error
    job.postprocess(
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/jobs.py", line 1009, in postprocess
    self.dag.handle_log(self)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/dag.py", line 638, in handle_log
    f.touch_or_create()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 679, in touch_or_create
    with open(file, "w") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'logs/sanitize_metadata_refrences.txt'

I think there are an issue with the sanitize_metadata and generating the log.

Second, I tried this command:

 $ snakemake --profile .  -p
 I got this error: 
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with

    snakemake --cleanup-metadata <filenames>

To re-generate the files rerun your command with the --rerun-incomplete flag.
Incomplete files:
results/sanitized_metadata.tsv.xz
results/sanitized_metadata_refrences.tsv.xz

Then run:
$ snakemake --profile . -p --rerun-incomplete
and got this error:

Error in rule sanitize_metadata:
    jobid: 13
    output: results/sanitized_metadata.tsv.xz
[Sun Oct 31 13:29:09 2021]
    log: logs/sanitize_metadata.txt (check log file(s) for error message)
Error in rule sanitize_metadata:
    shell:

        python3 scripts/sanitize_metadata.py             --metadata data/metadata.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata.tsv.xz 2>&1 | tee logs/sanitize_metadata.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    jobid: 18
    output: results/sanitized_metadata_refrences.tsv.xz
    log: logs/sanitize_metadata_refrences.txt (check log file(s) for error message)
Logfile logs/sanitize_metadata.txt not found.
    shell:

        python3 scripts/sanitize_metadata.py             --metadata data/references_metadata.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_refrences.tsv.xz 2>&1 | tee logs/sanitize_metadata_refrences.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/sanitize_metadata_refrences.txt not found.

Traceback (most recent call last):
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 653, in touch
    lutime(self.file, times)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 67, in lutime
    os.utime(f, times, follow_symlinks=False)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 667, in touch_or_create
    self.touch()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 656, in touch
    raise MissingOutputException(
snakemake.exceptions.MissingOutputException: Job Output file logs/sanitize_metadata.txt of rule sanitize_metadata shall be touched but does not exist. completed successfully, but some output files are missing.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/__init__.py", line 699, in snakemake
    success = workflow.execute(
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/workflow.py", line 1069, in execute
    success = self.scheduler.schedule()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/scheduler.py", line 441, in schedule
    self._error_jobs()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/scheduler.py", line 557, in _error_jobs
    self._handle_error(job)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/scheduler.py", line 614, in _handle_error
    self.get_executor(job).handle_job_error(job)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 611, in handle_job_error
    super().handle_job_error(job)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 277, in handle_job_error
    job.postprocess(
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/jobs.py", line 1009, in postprocess
    self.dag.handle_log(self)
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/dag.py", line 638, in handle_log
    f.touch_or_create()
  File "/home/bioinformatics/miniconda3/envs/nextstrain/lib/python3.8/site-packages/snakemake/io.py", line 679, in touch_or_create
    with open(file, "w") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'logs/sanitize_metadata.txt'

So, if anyone has the same error or know how to solve it… Kindly share with us

Thank you,

Hi @AroobAlhumaidy, I’m not sure if this is the same issue described in this other post, since the log file itself isn’t getting created.

What happens when you run the sanitize metadata command by itself like so from your nextstrain Conda environment?

python3 scripts/sanitize_metadata.py \
  --metadata data/references_metadata.tsv \
  --metadata-id-columns strain name 'Virus name' \
  --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession \
  --parse-location-field Location \
  --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content \
  --strip-prefixes hCoV-19/ SARS-CoV-2/ \
  --output results/sanitized_metadata_refrences.tsv.xz 2>&1 | tee logs/sanitize_metadata_refrences.txt

My first thought is that the tee command does not exist or is not behaving like we’d expect.

Hello Jihudd,

Thank you for your response,

I tried the suggested command and got this:

# Command: 
$ python3 scripts/sanitize_metadata.py   --metadata data/references_metadata.tsv   --metadata-id-columns strain name 'ncov'   --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession   --parse-location-field Location   --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content   --strip-prefixes hCoV-19/ SARS-CoV-2/   --output results/sanitized_metadata_refrences.tsv.xz 2>&1 | tee logs/sanitize_metadata_refrences.txt

#Output
tee: logs/sanitize_metadata_refrences.txt: No such file or directory

we are using local metadata, not from GISAID.

Also, the tee command exists in the Nexstrain environment

we appreciate your help,

Thank you for checking, @AroobAlhumaidy. The error from tee suggests that the logs directory doesn’t exist. Can you confirm that logs/ exists and then create it with mkdir logs if it doesn’t? Then you should be able to run the sanitize command again without issues.

Snakemake tries to create directories that don’t exist for all output files including logs, so it’s strange that it fails to do so in your original example. Would you mind also confirming which version of Snakemake you see when you run snakemake --version?

Hello jlhudd,

The logs file exist and contain other logs, except for the sanitized metadata

(nextstrain) PATH/ncov$ ls PATH/ncov/logs/
align_refrences.txt  align_S.txt   mask_refrences.txt  mask_S.txt  sanitize_sequences_refrences.txt  sanitize_sequences_S.txt

As for the snakemake version:

(nextstrain) PATH/ncov$ snakemake --version
6.10.0

Thank you,

I pulled the nextstrain repository again, the good news is that the sanitized metadata error has been resolved.

However, I got this one:

Error in rule tree:
    jobid: 6
    output: results/default-build/tree_raw.nwk
    log: logs/tree_default-build.txt (check log file(s) for error message)
    shell:

        augur tree             --alignment results/default-build/aligned.fasta             --tree-builder-args '-ninit 10 -n 4'             --output results/default-build/tree_raw.nwk             --nthreads 1 2>&1 | tee logs/tree_default-build.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/tree_default-build.txt:

ERROR: Shell exited 2 when running: iqtree2 -ninit 2 -n 2 -me 0.05 -nt 1 -s results/default-build/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/default-build/aligned-delim.iqtree.log
Command output was:
  ERROR: Alignment must have at least 3 sequences

Building a tree via:
        iqtree2 -ninit 2 -n 2 -me 0.05 -nt 1 -s results/default-build/aligned-delim.fasta -m GTR -ninit 10 -n 4 > results/default-build/aligned-delim.iqtree.log
        Nguyen et al: IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.
        Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300

ERROR: TREE BUILDING FAILED
Please see the log file for more details: results/default-build/aligned-delim.iqtree.log

Building original tree took 0.03627824783325195 seconds

I suspect that there are something went wrong with the alignment, because when I checked the aligned-delim.fasta and aligned.fasta there is only one sequence (the reference)

Thank you