Value error: trying to merge on object and int64 columns

Hello,
I’m trying to run Nextstrain but I’m receiving the following error message. I wonder if it is somethign related to the date format. Is someone able to explain how to fix it?:

Traceback (most recent call last):
  File "scripts/annotate_metadata_with_index.py", line 32, in <module>
    metadata.merge(
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/frame.py", line 9345, in merge
    return merge(
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 107, in merge
    op = _MergeOperation(
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
    self._maybe_coerce_merge_keys()
  File "/local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1257, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
[Thu May 12 15:30:08 2022]
Error in rule annotate_metadata_with_index:
    jobid: 23
    output: results/WTD-NY/metadata_with_index.tsv
    log: logs/annotate_metadata_with_index_WTD-NY.txt (check log file(s) for error message)
    conda-env: /local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8
    shell:

        python3 scripts/annotate_metadata_with_index.py             --metadata results/WTD-NY/metadata_with_nextclade_qc.tsv             --sequence-index results/WTD-NY/sequence_index.tsv             --output results/WTD-NY/metadata_with_index.tsv

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Hmm. Could you share a line or two of the results/WTD-NY/metadata_with_nextclade_qc.tsv file (please remove any sensitive information there) to help us see what’s happening?

Sure. here are the first two lines opened in excel:

strain virus date region country division location Nextstrain_clade Nextclade_pango QC_overall_score QC_overall_status divergence missing_data nonACGTN substitutions deletions insertions reversion_mutations potential_contaminants rare_mutations frame_shifts aaSubstitutions QC_missing_data QC_mixed_sites QC_rare_mutations QC_snp_clusters QC_frame_shifts QC_stop_codons clock_deviation
xxxxxx ncov 2021-11-20 North America USA New York xxxxx xxxxx xxxxx 34.958933 mediocre 50 940 0 0 1 16 good good mediocre good good good 8
xxxxxx ncov 2021-11-20 North America USA New York xxxxxx xxxxx xxxxx 473.157822 bad 51 5168 0 0 2 22 bad good mediocre good mediocre good 9

The particular script causing the error merges the metadata and the index on the column strain. From the error message, it seems like you may have values in your strain column that caused it to be interpreted as integers.

This should be an error that we can fix on our end by forcing the dtype of strain to always be ‘string’.

Indeed, my strain names were composed only by numbers. I changed that, but now I get the following error:

ERROR: All samples have been dropped! Check filter rules and metadata file format.
329 strains were dropped during filtering
165 had no metadata
164 of these were dropped by --exclude-all
164 strains were added back because they were in results/WTD-NY/sample-all.txt
[Sun May 15 15:16:14 2022]
Error in rule combine_samples:
jobid: 30
output: results/WTD-NY/WTD-NY_subsampled_sequences.fasta.xz, results/WTD-NY/WTD-NY_subsampled_metadata.tsv.xz
log: logs/subsample_regions_WTD-NY.txt (check log file(s) for error message)
conda-env: /local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/conda/9f0233e8
shell:

    augur filter             --sequences results/aligned_WTD-test.fasta.xz             --metadata results/sanitized_metadata_WTD-test.tsv.xz             --exclude-all             --include results/WTD-NY/sample-all.txt             --output-sequences results/WTD-NY/WTD-NY_subsampled_sequences.fasta.xz             --output-metadata results/WTD-NY/WTD-NY_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_WTD-NY.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job combine_samples since they might be corrupted:
results/WTD-NY/WTD-NY_subsampled_sequences.fasta.xz, results/WTD-NY/WTD-NY_subsampled_metadata.tsv.xz
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /local/workdir/lcc88/Nextstrain_test/ncov/.snakemake/log/2022-05-15T151607.817823.snakemake.log

and here is the builds section of the buids.yaml file:

builds:

Focus on New York State (division)

with a build name that will produce the following URL fragment on Nextstrain/auspice:

/ncov/north-america/usa/new-york

WTD-NY: # name of the build; this can be anything
subsampling_scheme: custom-county # use a custom subsampling scheme defined below
region: North America
country: USA
# Whatever your finest geographic scale is (here, ‘location’ since we are doing a county in the USA)
# list ‘up’ from here the geographic area that location is in.

It’s strange because i have only 164 samples and metadata rows, but the error says 329:

ERROR: All samples have been dropped! Check filter rules and metadata file format.
329 strains were dropped during filtering
165 had no metadata
164 of these were dropped by --exclude-all

The format of my strain column is for example:
WDC/165692/2021