Ncov: Errors from combine_metadata.py due to unexpected behavior in sanitize_metadata.py

Hi,

Lately I have been getting this error and I don’t know how to troubleshoot. I combine two datasets in my build, but as far as I know the same code used to work previously…

Jon

Job 123: 
        Combining metadata files results/sanitized_metadata_gisaid.tsv.xz results/sanitized_metadata_BN.tsv.xz -> results/combined_metadata.tsv.xz and adding columns to represent origin
        


        python3 scripts/combine_metadata.py --metadata results/sanitized_metadata_gisaid.tsv.xz results/sanitized_metadata_BN.tsv.xz --origins gisaid BN --output results/combined_metadata.tsv.xz 2>&1 | tee logs/combine_input_metadata.txt
        
Traceback (most recent call last):
  File "/home/jonr/Prosjekter/Nextstrain_mamba/ncov/scripts/combine_metadata.py", line 50, in <module>
    data = data.to_dict(orient="index")
  File "/home/jonr/.nextstrain/runtimes/conda/env/lib/python3.10/site-packages/pandas/core/frame.py", line 2063, in to_dict
    raise ValueError("DataFrame index must be unique for orient='index'.")
ValueError: DataFrame index must be unique for orient='index'.

This might be related to the Conda environment. I switched to using the Docker implementation of Nextstrain and it seems to work now…

Thanks Jon for raising this issue. You said it got fixed by switching from conda to docker. To help us figure out what the root cause was (and hence fix for you and others), would it be possible to tell a bit more about your setup?

What’s the pandas version you were using when you got the error? Pandas 2.0 was just released, it could be that the script is not compatible with pandas 2.0

Some further info that could help us:
Mac/Linux?
Do you manage the conda environment yourself or through nextstrain cli?
How do you invoke the build? Using snakemake, or nextstrain build? If the latter, with --ambient, --conda or --docker flags?

Thanks a lot!

Hi Cornelius,
I’m not sure I can answer everything, but I use the Nextstrain cli and invoke the build using nexstrain build. I installed nextstrain fresh a few months ago and it worked fine until I suddenly got this error. I didn’t specify --docker or --conda so I assume I used docker. But after I got the error I switched to using conda with --conda (yesterday) and got the same error. I then realized that I could update the docker images, and when I did so (yesterday) it worked again.

How can I check the pandas version with the nextstrain cli? I don’t think I can go back and check the versions that produced the error unfortunately…

1 Like

By the way,
I also get a “ValueError” when running a MPX build (this is with the docker implementation that worked for Sars-CoV-2). But I don’t know if this is related to the same error as mentioned above, or if I just did something wrong as this is the first time I analyze MPX. But the error happened during fix_tree.py:

Job 5: Building tree
Reason: Missing output files: results/hmpxv1/tree_fixed.nwk; Input files updated by another job: results/hmpxv1/tree_raw.nwk, results/hmpxv1/masked.fasta


        python3 scripts/fix_tree.py             --alignment results/hmpxv1/masked.fasta             --input-tree results/hmpxv1/tree_raw.nwk             --output results/hmpxv1/tree_fixed.nwk
        

0.00	-TreeAnc: set-up

32.33	-SequenceData: loaded alignment.

32.33	-SeqData: making compressed alignment...

78.29	-SequenceData: constructed compressed alignment...

90.48	-TreeAnc.optimize_tree: sequences...

90.48	-TreeAnc.infer_ancestral_sequences with method: probabilistic, joint

90.48	WARNING: Previous versions of TreeTime (<0.7.0) RECONSTRUCTED sequences of
     	tips at positions with AMBIGUOUS bases. This resulted in unexpected
     	behavior is some cases and is no longer done by default. If you want to
     	replace those ambiguous sites with their most likely state, rerun with
     	`reconstruct_tip_states=True` or `--reconstruct-tip-states`.
90.48	--TreeAnc._ml_anc_joint: type of reconstruction: Joint

179.51	-TreeAnc.optimize_branch_length: running branch length optimization using
      	 jointML ancestral sequences

191.45	-TreeAnc.prune_short_branches: pruning short branches (max prob at
      	 zero)...

194.08	-TreeAnc.infer_ancestral_sequences with method: probabilistic, joint
194.08	--TreeAnc._ml_anc_joint: type of reconstruction: Joint
262.32	--TreeAnc.optimize_tree: Iteration 1. #Nuc changed since prev
      	  reconstructions: 3952

262.32	-TreeAnc.optimize_branch_length: running branch length optimization using
      	 jointML ancestral sequences

272.08	-TreeAnc.prune_short_branches: pruning short branches (max prob at
      	 zero)...

272.08	-TreeAnc.infer_ancestral_sequences with method: probabilistic, joint
272.08	--TreeAnc._ml_anc_joint: type of reconstruction: Joint
339.99	--TreeAnc.optimize_tree: Iteration 2. #Nuc changed since prev
      	  reconstructions: 0
340.01	--TreeAnc.optimize_tree: Unconstrained sequence LH:-3627565.628224
### Checking for immediate reversions

Below NODE_0000000: ('G', 162638, 'T') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('G', 168184, 'C') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('G', 28709, 'T') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('C', 172303, 'G') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('A', 47734, 'T') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('A', 186076, 'C') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('A', 4835, 'G') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('C', 176714, 'A') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('A', 14430, 'C') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('C', 4690, 'A') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('A', 158786, 'G') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000000: ('G', 173473, 'T') in NODE_0000002 reverted in NODE_0000023
Below NODE_0000011: ('A', 7005, 'T') in NODE_0000014 reverted in NODE_0000015
Below NODE_0000023: ('A', 6701, 'C') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('T', 6697, 'C') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('T', 6580, 'C') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('G', 6854, 'T') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('T', 6873, 'G') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('G', 6698, 'T') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('T', 6687, 'C') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('T', 6835, 'A') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('G', 6601, 'C') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('T', 6828, 'C') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('G', 6686, 'A') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('A', 6836, 'G') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('C', 6646, 'A') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000023: ('G', 6809, 'A') in NODE_0000024 reverted in NODE_0000026
Below NODE_0000024: ('C', 6994, 'G') in NODE_0000026 reverted in NODE_0000027
Below NODE_0000024: ('T', 6975, 'A') in NODE_0000026 reverted in NODE_0000027
Below NODE_0000024: ('C', 6993, 'A') in NODE_0000026 reverted in NODE_0000027
Below NODE_0000024: ('T', 6935, 'C') in NODE_0000026 reverted in NODE_0000027
Below NODE_0000024: ('G', 6982, 'T') in NODE_0000026 reverted in NODE_0000027
Traceback (most recent call last):
  File "/nextstrain/build/scripts/fix_tree.py", line 63, in <module>
    reversion["child"].clades.remove(reversion["grandchild"])
ValueError: list.remove(x): x not in list

1 Like

@jonr Running nextstrain version --verbose will report some output that’s useful to us and will answer some of @corneliusroemer’s questions.

1 Like

Actually, now I got the same error also with the docker version. But this time on a different dataset…
Error message:

Traceback (most recent call last):
  File "/nextstrain/build/scripts/combine_metadata.py", line 50, in <module>
    data = data.to_dict(orient="index")
  File "/usr/local/lib/python3.10/site-packages/pandas/core/frame.py", line 2063, in to_dict
    raise ValueError("DataFrame index must be unique for orient='index'.")
ValueError: DataFrame index must be unique for orient='index'.

Nextstrain versions:

nextstrain.cli 6.2.1

Python
  /home/jonr/.nextstrain/cli-standalone/nextstrain
  3.10.9 (main, Dec 21 2022, 04:02:04) [Clang 14.0.3 ]

Runners
  docker (default)
    nextstrain/base:build-20230411T103027Z (fe870c159275, 2023-04-11 13:15:42 +0200 CEST)
    augur 21.1.0
    auspice v2.45.2
    fauna e3ed8e1
    sacra not present

  conda 
/usr/bin/env: ‘node’: No such file or directory
    nextstrain-base 20230407T195218Z (hb0f4dca_1_locked, nextstrain)
    augur 21.1.0

  singularity 
    docker://nextstrain/base (not present)

  ambient 
    unknown

  aws-batch 
    unknown

1 Like

@jonr thanks for the update. With this information, we can rule out that Pandas 2.0 is a reason for the error, since that image uses pandas 1.5.3 on both platform variants:

$ docker run --rm nextstrain/base:build-20230411T103027Z python3 -c "from importlib.metadata import version; print(version(\"pandas\"))"
1.5.3

The error strongly hints that there is a duplicate when values in the ID columns for each metadata file are combined. This can happen if your gisaid and BN input metadata files has some overlap in rows, when they should not.

1 Like

Thanks for your feedback Victor,
I compared “Virus names” in the two datasets and I can’t find any duplicates…
I’ll try to remove the sanitized metadata for the two datasets and try again.

Still get the same error.
Is it possible that this error comes from something else than duplicated rows in the two datasets?
I can try to regenerate new data, but since I use the whole Gisaid it takes a while…

Hmm, Virus names isn’t a column name that is used as an ID column. Can you try the following steps to check for duplicates?

  1. Open a command prompt at your ncov directory.

  2. Run nextstrain shell . to start an interactive shell for the Nextstrain runtime (for direct access to Augur).

  3. Run python to start an interactive Python shell.

  4. Run the following lines in Python:

    from augur.io import read_metadata
    df = read_metadata("results/sanitized_metadata_gisaid.tsv.xz")
    print(df[df.index.duplicated()].index)  # This should print any duplicates along with the inferred ID column name.
    
  5. Run the same Python lines for results/sanitized_metadata_BN.tsv.xz, swapping the filename in the 2nd line.

1 Like

Not that I can think of. From the pandas source code, it’s clear that this error is indeed raised when the DataFrame index is not unique.

In your case, the error comes from calling to_dict() on a pandas DataFrame returned by augur.io.read_metadata(). This augur.io.read_metadata() function sets the index based on the inferred ID column. Since combine_metadata.py does not customize the id_columns parameter to augur.io.read_metadata(), this means the inferred ID column is either strain or name, whichever comes first in the metadata file.

The way I see it, this also means that there is a duplicate within one of your metadata files (not strictly after the combination of both files).

Victor,
It indeed seems to be many duplicated entries in the Gisaid data:

>>> print(df[df.index.duplicated()].index)
Index(['USA/OR-OHSU-223490357/2022', 'USA/OR-OHSU-223570088/2022',
       'USA/OR-OHSU-223490477/2022', 'USA/OR-OHSU-223490467/2022',
       'USA/OR-OHSU-230620066/2022', 'USA/OR-OHSU-223570125/2022',
       'USA/OR-OHSU-230530279/2022', 'USA/OR-OHSU-223570098/2022',
       'USA/OR-OHSU-230620038/2022', 'USA/OR-OHSU-223570152/2022',
       ...
       'USA/OR-OHSU-230620054/2022', 'USA/OR-OHSU-223490476/2022',
       'USA/OR-OHSU-223570191/2022', 'USA/OR-OHSU-223570190/2022',
       'USA/OR-OHSU-223570068/2022', 'USA/OR-OHSU-230620019/2022',
       'USA/OR-OHSU-223490434/2022', 'USA/OR-OHSU-223570149/2022',
       'USA/OR-OHSU-223570123/2022', 'USA/OR-OHSU-223490437/2022'],
      dtype='object', name='strain', length=267)

And from the second dataset (BN):

>>> df = read_metadata("results/sanitized_metadata_BN.tsv.xz")
>>> print(df[df.index.duplicated()].index)
Index([], dtype='object', name='strain')

But I thought that duplicated entries were removed during the “sanitize” step? I never had any problems using the Gisaid database like this before.

Thanks for your help,

1 Like

Hi @jonr, that’s helpful information. I think what you’re seeing is related to this bug with the sanitize_metadata.py script. I am looking into it now and will post here with any updates.

As an immediate workaround, you could try modifying the workflow to run the script twice (mentioned in the linked issue), though this is untested and I’m unsure if there are other implications of running the script twice.

Ok, thanks a lot.
I removed all duplicates in the “Virus name” column before running the pipeline, but still the problem arose. So probably there are some more duplicates after trimming names or something as described in the bug you linked to. I will sanitize twice and see what happens.

Running the sanitize metadata script twice did the trick. At least the combine metadata step worked and the pipeline moved on to filtering.
Thanks for your help everyone!

The latest version of the workflow on the master branch of nextstrain/ncov has been updated with a fix so there should be no need to run the sanitize_metadata.py a second time.

Great!! I’ll try it out on a large Gisaid dataset tomorrow.
Thanks a lot for your help

Dear @victorlin,
I did nextstrain update and ran the analysis again on a newly downloded dataset from Gisaid. I now get this error:

[Tue Apr 25 17:23:31 2023]
Job 14: 
        Combining metadata files results/sanitized_metadata_gisaid.tsv.xz results/sanitized_metadata_BN.tsv.xz -> results/combined_metadata.tsv.xz and adding columns to represent origin
        
Reason: Input files updated by another job: results/sanitized_metadata_BN.tsv.xz, results/sanitized_metadata_gisaid.tsv.xz


        python3 scripts/combine_metadata.py --metadata results/sanitized_metadata_gisaid.tsv.xz results/sanitized_metadata_BN.tsv.xz --origins gisaid BN --output results/combined_metadata.tsv.xz 2>&1 | tee logs/combine_input_metadata.txt
        
Traceback (most recent call last):
  File "/nextstrain/build/scripts/combine_metadata.py", line 47, in <module>
    data = read_metadata(fname)
  File "/nextstrain/augur/augur/io/metadata.py", line 87, in read_metadata
    raise Exception(f"None of the possible id columns ({id_columns!r}) were found in the metadata's columns {tuple(chunk.columns)!r}")
Exception: None of the possible id columns (('strain', 'name')) were found in the metadata's columns ('Unnamed: 0', 'Last vaccinated', 'Passage details/history', 'type', 'gisaid_epi_isl', 'date', 'additional_location_information', 'length', 'host', 'patient_age', 'sex', 'GISAID_clade', 'pango_lineage', 'Pango version', 'variant', 'aaSubstitutions', 'date_submitted', 'is_reference', 'is_complete', 'is_high_coverage', 'is_low_coverage', 'n_content', 'gc_content', 'region', 'country', 'division', 'location')
[Tue Apr 25 17:23:33 2023]
Error in rule combine_input_metadata:
    jobid: 14
    input: results/sanitized_metadata_gisaid.tsv.xz, results/sanitized_metadata_BN.tsv.xz
    output: results/combined_metadata.tsv.xz
    log: logs/combine_input_metadata.txt (check log file(s) for error details)
    conda-env: /nextstrain/build/.snakemake/conda/96b3e15cea072093949ef6194e50cdb3_
    shell:
        
        python3 scripts/combine_metadata.py --metadata results/sanitized_metadata_gisaid.tsv.xz results/sanitized_metadata_BN.tsv.xz --origins gisaid BN --output results/combined_metadata.tsv.xz 2>&1 | tee logs/combine_input_metadata.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

These are the column names in the sanitized_gisaid metadata (NB: notice what seems to be an empty first column?):

	Last vaccinated	Passage details/history	type	gisaid_epi_isl	date	additional_location_information	length	host	patient_age	sex	GISAID_clade	pango_lineage	Pango version	variant	aaSubstitutions	date_submitted	is_reference	is_complete	is_high_coverage	is_low_coverage	n_content	gc_contentregion	country	division	location

And these are the column names in the original metadata:

Virus name	Last vaccinated	Passage details/history	Type	Accession ID	Collection date	Location	Additional location information	Sequence length	Host	Patient age	Gender	Clade	Pango lineage	Pango version	Variant	AA Substitutions	Submission date	Is reference?	Is complete?	Is high coverage?	Is low coverage?	N-Content	GC-Content

From the Gisaid metadata I guess that the “Virus name” column should be the equivalent to either “strain” or “name”?

Hi @jonr,

It looks like sanitize_metadata.py is renaming Virus name to an empty string, when I would expect it to be renamed to strain assuming rename_fields is using the default. I can only reproduce similar behavior error by setting "Virus name=" in rename_fields.

Can you send me the output that shows the parameters passed to this script? It should start with:

python3 scripts/sanitize_metadata.py --metadata …

– Victor