Sanitize_metadata.py error: ERROR: ' ' expected after '"'

Hi everyone,

I am running a nextstrain job on a server. I already had an issue with sanitize_metadata.py script:

Error in rule sanitize_metadata:
    jobid: 12
    output: results/sanitized_metadata_October-data.tsv.xz
    log: logs/sanitize_metadata_October-data.txt (check log file(s) for error message)
    shell:
        
        python3 scripts/sanitize_metadata.py             --metadata /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_October-data.tsv.xz 2>&1 | tee logs/sanitize_metadata_October-data.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/sanitize_metadata_October-data.txt:
ERROR: field larger than field limit (131072)

But easily fixed it by adding to the sanitize_metadata.py the following line:

csv.field_size_limit(sys.maxsize)

According to this discussion on stackoverflow, this line should fix the problem. It did it… but then I got one more error:

[Sun Oct 31 22:31:38 2021]
rule sanitize_metadata:
    input: /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv
    output: results/sanitized_metadata_October-data.tsv.xz
    log: logs/sanitize_metadata_October-data.txt
    jobid: 12
    benchmark: benchmarks/sanitize_metadata_October-data.txt
    wildcards: origin=October-data
    resources: tmpdir=/tmp, mem_mb=8000


        python3 scripts/sanitize_metadata.py             --metadata /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_October-data.tsv.xz 2>&1 | tee logs/sanitize_metadata_October-data.txt
        
[Sun Oct 31 22:31:48 2021]
Error in rule sanitize_metadata:
    jobid: 12
    output: results/sanitized_metadata_October-data.tsv.xz
    log: logs/sanitize_metadata_October-data.txt (check log file(s) for error message)
    shell:
        
        python3 scripts/sanitize_metadata.py             --metadata /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_October-data.tsv.xz 2>&1 | tee logs/sanitize_metadata_October-data.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/sanitize_metadata_October-data.txt:
ERROR: '	' expected after '"'


Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

I believe that the problem might be with my metadata file, but I’m using the metdata.tsv downloaded from the GISAID and so far I didn’t have any issues with it.

Honestly, so far I don’t even have a clue what went wrong…

Thank you in advance!

Best,
Dmitrii

I have the same issue

Hi @qwerty123, it looks like the CSV field size limit is a general Python limitation that we should do a better job of handling. I created an issue to handle this better in Augur (which is the tool that gets used behind the scenes from the sanitize scripts).

Regarding the next error you ran into, we need to inspect the logs to figure out what’s happening. Can you share the contents of the log file logs/sanitize_metadata_October-data.txt?

Edit: I forgot to note that I also just ran the same command you shared on the latest GISAID metadata, so hopefully the logs output will help us figure out the root cause here.

Hi @jlhudd , I checked the log file you have mentioned logs/sanitize_metadata_October-data.txt:

ERROR: '	' expected after '"'

Unfortunately, that’s all that I have in the log file.

Also there is an encoding issue on the server I’m using to run my job:

Traceback (most recent call last):
  File "/lustre/scratch2/ws/1/dmse952c-p_cov2muta/ncov_covid/scripts/sanitize_metadata.py", line 415, in <module>
    metadata.to_csv(
  File "/scratch/ws/1/dmse952c-p_cov2muta/ncov/lib/python3.9/site-packages/pandas/core/generic.py", line 3466, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "/scratch/ws/1/dmse952c-p_cov2muta/ncov/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1105, in to_csv
    csv_formatter.save()
  File "/scratch/ws/1/dmse952c-p_cov2muta/ncov/lib/python3.9/site-packages/pandas/io/formats/csvs.py", line 257, in save
    self._save()
  File "/scratch/ws/1/dmse952c-p_cov2muta/ncov/lib/python3.9/site-packages/pandas/io/formats/csvs.py", line 262, in _save
    self._save_body()
  File "/scratch/ws/1/dmse952c-p_cov2muta/ncov/lib/python3.9/site-packages/pandas/io/formats/csvs.py", line 300, in _save_body
    self._save_chunk(start_i, end_i)
  File "/scratch/ws/1/dmse952c-p_cov2muta/ncov/lib/python3.9/site-packages/pandas/io/formats/csvs.py", line 311, in _save_chunk
    libwriters.write_csv_rows(
  File "pandas/_libs/writers.pyx", line 72, in pandas._libs.writers.write_csv_rows
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 238: ordinal not in range(256)

To overcome it I’m using vim to replace left and right double quotation marks with " (just a quotation mark). Afterwards, I’m trying to run a job again, but facing the issue I mentioned above:

ERROR: '	' expected after '"'

Maybe the quotation mark replacement is somehow distorting some field in the metadata, but without doing it I can’t run the nextstrain.

Best,
Dmitrii

@qwerty123, would you be comfortable sharing your input file, /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv, via a direct message to me (if possible, also compressed)? I have a couple of ideas about what’s happening with both error messages, but testing with the actual data would help to confirm.

In the short term, we can modify the sanitize_metadata.py script to provide more detailed error messages (like the full traceback from your server run). These details can help a lot with debugging.

@jlhudd, I found the solution! Before I was using metadata from GISAID that has a such header:

strain	virus	gisaid_epi_isl	genbank_accession	date	region	country	division	location	region_exposure	country_exposure	division_exposure	segment	length	host	age	sex	Nextstrain_clade	pango_lineage	GISAID_clade	originating_lab	submitting_labauthors	url	title	paper_url	date_submitted	purpose_of_sequencing

But now I downloaded the new one with the header like this:

Virus name	Type	Accession ID	Collection date	Location	Additional location information	Sequence length	Host	Patient age	GenderClade	Pango lineage	Pangolin version	Variant	AA Substitutions	Submission date	Is reference?	Is complete?	Is high coverage?	Is low coverage?	N-Content	GC-Content

and it works perfectly.

Looks like the issue was in the wrong metadata format. I am very sorry for the confusion.

Thank you very much for your help and for your time. I really appreciate it!

Best,
Dmitrii

1 Like

Great, I’m glad you got this working! The other metadata you were using looks sort of like the “nextmeta” file GISAID used to provide, although I thought that file had been dropped from the list of possible downloads a while ago.

In any case, I just merged a change to the ncov workflow that will make the sanitize metadata error logs more informative for unexpected errors. If you pull the latest version of the workflow, you should get more helpful errors (but hopefully you won’t see any more errors :wink: ).