UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 1252: invalid start byte

Hello,
I was running Nextstrain normally today but now I’m receiving error messages and I can’t figure out what is the origin of the error.
Here is the entire output:

(nextstrain) [lcc88@cbsuahdcvir ncov]$ nextstrain build . --cores 16 --configfile my_profiles/builds.yaml
Your config specifies 'skip_travel_history_adjustment=True'. This is now always the case, and thus this parameter can be removed.
Building DAG of jobs...
Using shell: /home/lcc88/.nextstrain/runtimes/conda/env/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
        count   jobs
        1       add_branch_labels
        1       adjust_metadata_regions
        1       all
        1       ancestral
        1       annotate_metadata_with_index
        1       assign_rbd_levels
        1       build_align
        1       build_description
        1       calculate_epiweeks
        1       clade_files
        1       clades
        1       combine_input_metadata
        1       combine_samples
        1       combine_sequences_for_subsampling
        1       diagnostic
        1       distances
        1       emerging_lineages
        1       export
        1       filter
        1       finalize
        1       include_hcov19_prefix
        1       index
        1       join_metadata_and_nextclade_qc
        1       logistic_growth
        1       mask
        1       mutational_fitness
        1       recency
        1       refine
        1       rename_emerging_lineages
        1       sanitize_metadata
        1       subsample
        1       tip_frequencies
        1       traits
        1       translate
        1       tree
        35

[Thu Mar  9 15:31:09 2023]
rule sanitize_metadata:
    input: data/CCTL_sequencing/metadata_03-09-23.tsv
    output: results/sanitized_metadata_custom_data.tsv.xz
    log: logs/sanitize_metadata_custom_data.txt
    jobid: 37
    benchmark: benchmarks/sanitize_metadata_custom_data.txt
    wildcards: origin=custom_data
    resources: mem_mb=2000


        python3 scripts/sanitize_metadata.py             --metadata data/CCTL_sequencing/metadata_03-09-23.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_custom_data.tsv.xz 2>&1 | tee logs/sanitize_metadata_custom_data.txt


[Thu Mar  9 15:31:09 2023]
rule clade_files:
    input: defaults/clades.tsv
    output: results/All_CCTL_sequences_03-09-23/clades.tsv
    jobid: 25
    benchmark: benchmarks/clade_files_All_CCTL_sequences_03-09-23.txt
    wildcards: build_name=All_CCTL_sequences_03-09-23


        cat defaults/clades.tsv > results/All_CCTL_sequences_03-09-23/clades.tsv


[Thu Mar  9 15:31:09 2023]
Job 32:
        Combine and deduplicate aligned FASTAs from multiple origins in preparation for subsampling.



        python3 scripts/sanitize_sequences.py                 --sequences results/aligned_custom_data.fasta.xz results/aligned_references.fasta.xz                 --strip-prefixes hCoV-19/ SARS-CoV-2/                                  --output /dev/stdout                 | xz -c -2 > results/combined_sequences_for_subsampling.fasta.xz


[Thu Mar  9 15:31:09 2023]
Job 19: Templating build description for Auspice

[Thu Mar  9 15:31:09 2023]
Finished job 25.
1 of 35 steps (3%) done
Your config specifies 'skip_travel_history_adjustment=True'. This is now always the case, and thus this parameter can be removed.
Job counts:
        count   jobs
        1       build_description
        1
[Thu Mar  9 15:31:10 2023]
Finished job 19.
2 of 35 steps (6%) done
Traceback (most recent call last):
  File "/local/workdir/lcc88/Nextstrain/ncov/scripts/sanitize_metadata.py", line 405, in <module>
    database_ids_by_strain = get_database_ids_by_strain(
  File "/local/workdir/lcc88/Nextstrain/ncov/scripts/sanitize_metadata.py", line 211, in get_database_ids_by_strain
    for metadata in metadata_reader:
  File "/home/lcc88/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
    return self.get_chunk()
  File "/home/lcc88/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
    return self.read(nrows=size)
  File "/home/lcc88/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/lcc88/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 250, in read
    content = self._get_lines(rows)
  File "/home/lcc88/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 1114, in _get_lines
    new_rows.append(next(self.data))
  File "/home/lcc88/.nextstrain/runtimes/conda/env/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 1252: invalid start byte
[Thu Mar  9 15:31:10 2023]
Error in rule sanitize_metadata:
    jobid: 37
    output: results/sanitized_metadata_custom_data.tsv.xz
    log: logs/sanitize_metadata_custom_data.txt (check log file(s) for error message)
    shell:

        python3 scripts/sanitize_metadata.py             --metadata data/CCTL_sequencing/metadata_03-09-23.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_custom_data.tsv.xz 2>&1 | tee logs/sanitize_metadata_custom_data.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Thu Mar  9 15:31:11 2023]
Finished job 32.
3 of 35 steps (9%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /local/workdir/lcc88/Nextstrain/ncov/.snakemake/log/2023-03-09T153109.250600.snakemake.log

Thank you

Leonardo

Hey @leocaserta - this appears to be caused by the data/CCTL_sequencing/metadata_03-09-23.tsv file using windows (windows-1252) encoding - would this seem right to you? There are lots of ways to convert a file to UTF-8 (e.g. for windows, for unix, using vs-code) so I would suggest trying that and seeing if that fixes things. Good luck - and let us know how you get on!

1 Like

Hi @james, thank you for your response, now it is working!
I opened the file in Notepad and chose encoding to UTF-8 after clicking in “save as”

Thank you

1 Like