I’m not sure, but it might seem that the new version of the sanitize metadata script also produces a lot of wrong “strain” names from Gisaid. For example, my sanitized metadata from Gisaid has about twice the number of lines as the input metadata, and there’s a lot of “new” strain names now with empty metadata. Here’s a sample of all strain names matching the string “67020”:
Hi @jonr, you’re right, I introduced another bug when making the earlier fix. I just created a PR on the GitHub repo to fix this new issue. You can pull from the victorlin/fix-sanitize-metadata branch if you need to use it before it gets merged.
I ran from your branch @victorlin and it seems to work now. At least all the sanitizing and combining and filtering has run smoothly and I’m now generating the trees.
Thanks
@jonr I’m glad it worked! I’ve just merged the changes into the master branch and deleted my temporary branch that you pulled from, so you can run from the default branch now.