Error when running sanitize_sequences.py on whole GISAID dump

Hello,

I am following the data preparation protocol as per Preparing your data — SARS-CoV-2 Workflow documentation, but when I run the sanitize_sequences.py I get this error:

Traceback (most recent call last):
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/augur/io.py", line 33, in open_file
    with xopen(path_or_buffer, mode, **kwargs) as handle:
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/xopen/__init__.py", line 803, in xopen
    filename = os.fspath(filename)
TypeError: expected str, bytes or os.PathLike object, not PipedIGzipWriter

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/augur/io.py", line 34, in open_file
    yield handle
  File "/lustrehome/dsimone/nextstrain/ncov/scripts/sanitize_sequences.py", line 130, in <module>
    write_sequences(sequence, output_handle)
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/augur/io.py", line 187, in write_sequences
    sequences_written = Bio.SeqIO.write(
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/Bio/SeqIO/__init__.py", line 518, in write
    fp.write(format_function(record))
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/xopen/__init__.py", line 239, in write
    self._file.write(arg)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lustrehome/dsimone/nextstrain/ncov/scripts/sanitize_sequences.py", line 136, in <module>
    sys.exit(1)
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/augur/io.py", line 34, in open_file
    yield handle
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/xopen/__init__.py", line 135, in __exit__
    self.close()
  File "/lustrehome/dsimone/miniconda3/envs/nextstrain/lib/python3.9/site-packages/xopen/__init__.py", line 249, in close
    raise OSError(
OSError: Output igzip process terminated with exit code 252

this happens after ~20’ of run. I don’t know if this can be useful, but when the script fails I get a huge fasta.gz file (11GB with 1.4M sequences compared to the 2.2GB of the input fasta file which has >3M sequences).

Thanks for your help!

Domenico

Hello Domenico,

Thank you for sharing this! I was able to reproduce this once without the TypeError on my machine, but it ran successfully the second time. Have you also found this to be transient, or does it always happen?

– Victor

@dsimone Thanks for your useful report! I have a hunch after comparing the tracebacks with the source code for augur.io and sanitize_sequences.py. Do you see any additional error messages in the log lines before the errors you posted? In particular, I wonder if you see ERROR: The following strains have duplicate sequences: …?