Value error: trying to merge on object and int64 columns

leocaserta · May 12, 2022, 7:43pm

Hello,
I’m trying to run Nextstrain but I’m receiving the following error message. I wonder if it is somethign related to the date format. Is someone able to explain how to fix it?:

Traceback (most recent call last):
  File "scripts/annotate_metadata_with_index.py", line 32, in <module>
    metadata.merge(
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/frame.py", line 9345, in merge
    return merge(
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 107, in merge
    op = _MergeOperation(
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
    self._maybe_coerce_merge_keys()
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1257, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
[Thu May 12 15:30:08 2022]
Error in rule annotate_metadata_with_index:
    jobid: 23
    output: results/WTD-NY/metadata_with_index.tsv
    log: logs/annotate_metadata_with_index_WTD-NY.txt (check log file(s) for error message)
    conda-env: /local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8
    shell:

        python3 scripts/annotate_metadata_with_index.py             --metadata results/WTD-NY/metadata_with_nextclade_qc.tsv             --sequence-index results/WTD-NY/sequence_index.tsv             --output results/WTD-NY/metadata_with_index.tsv

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

james · May 12, 2022, 9:23pm

Hmm. Could you share a line or two of the results/WTD-NY/metadata_with_nextclade_qc.tsv file (please remove any sensitive information there) to help us see what’s happening?

leocaserta · May 13, 2022, 4:43pm

Sure. here are the first two lines opened in excel:

strain	virus	date	region	country	division	location	Nextstrain_clade	Nextclade_pango	QC_overall_score	QC_overall_status	divergence	missing_data	nonACGTN	substitutions	deletions	insertions	reversion_mutations	potential_contaminants	rare_mutations	frame_shifts	aaSubstitutions	QC_missing_data	QC_mixed_sites	QC_rare_mutations	QC_snp_clusters	QC_frame_shifts	QC_stop_codons	clock_deviation
xxxxxx	ncov	2021-11-20	North America	USA	New York	xxxxx	xxxxx	xxxxx	34.958933	mediocre	50	940	0				0	1	16			good	good	mediocre	good	good	good	8
xxxxxx	ncov	2021-11-20	North America	USA	New York	xxxxxx	xxxxx	xxxxx	473.157822	bad	51	5168	0				0	2	22			bad	good	mediocre	good	mediocre	good	9

joverlee · May 13, 2022, 10:25pm

The particular script causing the error merges the metadata and the index on the column strain. From the error message, it seems like you may have values in your strain column that caused it to be interpreted as integers.

This should be an error that we can fix on our end by forcing the dtype of strain to always be ‘string’.

leocaserta · May 15, 2022, 7:21pm

Indeed, my strain names were composed only by numbers. I changed that, but now I get the following error:

ERROR: All samples have been dropped! Check filter rules and metadata file format.
329 strains were dropped during filtering
165 had no metadata
164 of these were dropped by --exclude-all
164 strains were added back because they were in results/WTD-NY/sample-all.txt
[Sun May 15 15:16:14 2022]
Error in rule combine_samples:
jobid: 30
output: results/WTD-NY/WTD-NY_subsampled_sequences.fasta.xz, results/WTD-NY/WTD-NY_subsampled_metadata.tsv.xz
log: logs/subsample_regions_WTD-NY.txt (check log file(s) for error message)
conda-env: /local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8
shell:

    augur filter             --sequences results/aligned_WTD-test.fasta.xz             --metadata results/sanitized_metadata_WTD-test.tsv.xz             --exclude-all             --include results/WTD-NY/sample-all.txt             --output-sequences results/WTD-NY/WTD-NY_subsampled_sequences.fasta.xz             --output-metadata results/WTD-NY/WTD-NY_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_WTD-NY.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job combine_samples since they might be corrupted:
results/WTD-NY/WTD-NY_subsampled_sequences.fasta.xz, results/WTD-NY/WTD-NY_subsampled_metadata.tsv.xz
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/log/2022-05-15T151607.817823.snakemake.log

and here is the builds section of the buids.yaml file:

builds:

Focus on New York State (division)

with a build name that will produce the following URL fragment on Nextstrain/auspice:

/ncov/north-america/usa/new-york

WTD-NY: # name of the build; this can be anything
subsampling_scheme: custom-county # use a custom subsampling scheme defined below
region: North America
country: USA
# Whatever your finest geographic scale is (here, ‘location’ since we are doing a county in the USA)
# list ‘up’ from here the geographic area that location is in.

leocaserta · May 15, 2022, 7:51pm

It’s strange because i have only 164 samples and metadata rows, but the error says 329:

ERROR: All samples have been dropped! Check filter rules and metadata file format.
329 strains were dropped during filtering
165 had no metadata
164 of these were dropped by --exclude-all

The format of my strain column is for example:
WDC/165692/2021

Topic		Replies	Views
Ncov: Errors from combine_metadata.py due to unexpected behavior in sanitize_metadata.py Help and Getting Started	28	1246	May 1, 2023
Followed data prep instructions, nextstrain fails Help and Getting Started	20	822	December 16, 2021
"Could not determine delimiter" error with metadata files Help and Getting Started	5	239	August 28, 2024
Error message upon running analysis Help and Getting Started	0	458	July 6, 2021
ERROR: All samples have been dropped! Check filter rules and metadata file format Help and Getting Started	0	730	September 21, 2020

Value error: trying to merge on object and int64 columns

Focus on New York State (division)

with a build name that will produce the following URL fragment on Nextstrain/auspice:

/ncov/north-america/usa/new-york

Related topics