Hi everyone,
I am running a nextstrain job on a server. I already had an issue with sanitize_metadata.py script:
Error in rule sanitize_metadata:
jobid: 12
output: results/sanitized_metadata_October-data.tsv.xz
log: logs/sanitize_metadata_October-data.txt (check log file(s) for error message)
shell:
python3 scripts/sanitize_metadata.py --metadata /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv --metadata-id-columns strain name 'Virus name' --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession --parse-location-field Location --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content --strip-prefixes hCoV-19/ SARS-CoV-2/ --output results/sanitized_metadata_October-data.tsv.xz 2>&1 | tee logs/sanitize_metadata_October-data.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/sanitize_metadata_October-data.txt:
ERROR: field larger than field limit (131072)
But easily fixed it by adding to the sanitize_metadata.py the following line:
csv.field_size_limit(sys.maxsize)
According to this discussion on stackoverflow, this line should fix the problem. It did it… but then I got one more error:
[Sun Oct 31 22:31:38 2021]
rule sanitize_metadata:
input: /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv
output: results/sanitized_metadata_October-data.tsv.xz
log: logs/sanitize_metadata_October-data.txt
jobid: 12
benchmark: benchmarks/sanitize_metadata_October-data.txt
wildcards: origin=October-data
resources: tmpdir=/tmp, mem_mb=8000
python3 scripts/sanitize_metadata.py --metadata /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv --metadata-id-columns strain name 'Virus name' --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession --parse-location-field Location --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content --strip-prefixes hCoV-19/ SARS-CoV-2/ --output results/sanitized_metadata_October-data.tsv.xz 2>&1 | tee logs/sanitize_metadata_October-data.txt
[Sun Oct 31 22:31:48 2021]
Error in rule sanitize_metadata:
jobid: 12
output: results/sanitized_metadata_October-data.tsv.xz
log: logs/sanitize_metadata_October-data.txt (check log file(s) for error message)
shell:
python3 scripts/sanitize_metadata.py --metadata /projects/p_cov2muta/211015_GISAID/metadata_latin_aboveregions.tsv --metadata-id-columns strain name 'Virus name' --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession --parse-location-field Location --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content --strip-prefixes hCoV-19/ SARS-CoV-2/ --output results/sanitized_metadata_October-data.tsv.xz 2>&1 | tee logs/sanitize_metadata_October-data.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/sanitize_metadata_October-data.txt:
ERROR: ' ' expected after '"'
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I believe that the problem might be with my metadata file, but I’m using the metdata.tsv downloaded from the GISAID and so far I didn’t have any issues with it.
Honestly, so far I don’t even have a clue what went wrong…
Thank you in advance!
Best,
Dmitrii