Hi, I apologize for asking about a topic that is only marginally related to Nextstrain, but this is the only forum on SARS-CoV-2 phylogenomics that I’m aware of (please point me to more appropriate places if such exist).
Could anyone please explain how Pangolin designations work? In particular, for the AY.* sublineages of Delta, there is a list of defining mutations (New AY lineages and an update to AY.4-AY.12 – Pango Network), yet actual designations do not always correspond well to these designations (e.g., Possibly wrong assignments of AY.4 · Issue #221 · cov-lineages/pango-designation · GitHub). How does that happen? Is that once a novel clade, defined by certain mutation(s), is discovered, the Pango decision tree is trained on that clade and may pick up on artefacts that happen to also be in those sequences but that are not related to the defining mutations?
It seems that the Pango team eventually corrects these mistakes (a couple of weeks ago, AY.12 seemed to be dominating a lot of the data, but that seems to have been an artefact that later got fixed), but it takes a while.
Does anyone have experience dealing with these situations where GISAID seems to be assigning lineages too broadly (e.g., for the purposes of tracking the growth of different sublineages of Delta)? What do you do? Do you trust the current Pangolin assignments? Do you have custom filters?