Clock Deviation

Hello, I’m getting some odd results from my newest runs of Nextstrain, and I’m wondering if I have a bad setting somewhere that’s causing this. My setup is using the newest versions of the Nextstrain CLI and related programs, but some of my profile files are left over from previous versions, so it’s possible that’s what’s tripping me up.

My basic setup is that I have ~70,000 sequences and metadata, and I’m trying to pull roughly 12,000 of these at random, from the beginning of the pandemic to now, and make a nextstrain for them. This always worked before, but now I’m generating results that only show ~7000 samples.

Digging into my Results files a bit, it appears the remaining ~5000 subsampled strains are getting thrown out because of clock_deviation. If I filter metadata_with_nextstrain_qc.tsv to only show strains with a clock_deviation of < -20, roughly 5000 samples are shown. This seems in error. As I understand it, clock_deviation is meant to filter out samples with bad date data, like omicron samples that are incorrectly labelled January 2021 instead of January 2022, using the expected clock frequency of mutations over time. But I can’t believe that 40% of my dataset are correctly being flagged as wrong. These are all GISAID samples, and a cursory glance through makes this data seem fine, date-wise.

If I sort by clock_deviation, the “worst” offenders are all 20J (Gamma, V3) samples, for some reason.

My guess is that I have some profile setting somewhere that is causing incorrect calculations of clock_frequency. Any ideas? Thank you for your time.