Ncov: Errors from combine_metadata.py due to unexpected behavior in sanitize_metadata.py

Since this is different from the original issue, I’ve updated the title of this discussion post from

ValueError: DataFrame index must be unique for orient=‘index’

to

Ncov: Errors from combine_metadata.py due to unexpected behavior in sanitize_metadata.py

which captures both issues.

1 Like

Is this ok?

        python3 scripts/sanitize_metadata.py             --metadata data/SC2_weekly/Gisaid.metadata.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_gisaid.tsv.xz 2>&1 | tee logs/sanitize_metadata_gisaid.txt

I’m not sure, but it might seem that the new version of the sanitize metadata script also produces a lot of wrong “strain” names from Gisaid. For example, my sanitized metadata from Gisaid has about twice the number of lines as the input metadata, and there’s a lot of “new” strain names now with empty metadata. Here’s a sample of all strain names matching the string “67020”:

   ...1                                     
   <chr>                                    
 1 Australia/SA467020/2022                  
 2 67020                                    
 3 Japan/PG-367020/2022                     
 4 USA/WA-CDC-UW22092670209/2022            
 5 67020                                    
 6 67020                                    
 7 USA/WA-UW-22082367020/2022               
 8 Russia/MOW-PMVL-DZ-44266702040/2022      
 9 67020                                    
10 67020                                    
11 Japan/TKYkbm67020/2022                   
12 67020                                    
13 67020                                    
14 Ireland/CW-NVRL-ecS22IRL00467020/2022    
15 67020                                    
16 67020                                    
17 67020                                    
18 Luxembourg/LNS7670209/2023               
19 Ireland/CO-NVRL-ecS22IRL00367020/2022    
20 67020                                    
21 France/NAQ-HCL722003670201/2022          
22 Denmark/DCGC-567020/2022                 
23 67020                                    
24 67020                                    
25 67020                                    
26 Germany/BW-RKI-I-967020/2022             
27 67020                                    
28 USA/NC-CDC-LC0867020/2022                
29 67020                                    
30 67020                                    
31 67020                                    
32 Brazil/PR-NVBS23614GENOV829670206686/2022
33 Reunion/ChuReu-722070670201/2022         
34 67020                                    
35 67020                                    
36 67020                                    
37 Japan/PG-467020/2023                     
38 67020                                    
39 67020                                    
40 USA/NM-CDC-813670208/2023                
41 67020                                    
42 Canada/QC-L00546702001/2022              
43 67020                                    
44 Scotland/EDB67020/2022                   
45 Australia/VIC67020/2022                  
46 67020                                    
47 Belgium/12212670201/2022                 
48 France/ARA-cerba-22Q0670203/2022         
49 Canada/QC-L00506702001/2022              
50 67020 

In fact, the row number in the sanitized metadata from Gisaid is exactly twice the original metadata.

It also seems like the splitting of the Location column goes wrong now. The country and region columns for example are empty in the sanitized metadata

Hi @jonr, you’re right, I introduced another bug when making the earlier fix. I just created a PR on the GitHub repo to fix this new issue. You can pull from the victorlin/fix-sanitize-metadata branch if you need to use it before it gets merged.

Awesome! Thanks again Victor

I ran from your branch @victorlin and it seems to work now. At least all the sanitizing and combining and filtering has run smoothly and I’m now generating the trees.
Thanks

1 Like

@jonr I’m glad it worked! I’ve just merged the changes into the master branch and deleted my temporary branch that you pulled from, so you can run from the default branch now.

Apologies for the troubles here!

1 Like