I noticed a difference in the clade assignment between the influenza references.
I have 7 sequences assigned 3C.2a1a with reference A/Wisconsin/67/2005 (CY163680) which are assigned 3C.2a1 with reference A/Darwin/6/2021 (EPI1857216).
These 7 samples have the mutations: HA1:N121K + HA1:T135K + HA2:G150E attributed to clade 3C.2a1a.
I also have this problem on a 3C.2a4 sample with ref A/Wisconsin/67/2005 (CY163680) which becomes 3C.2a with ref A/Darwin/6/2021 (EPI1857216).
Which of the two references gives the correct clade?
Lorlane Le Targa
Thank you for reaching out with this question and apologies for the delay in getting back to you.
This is a great question. The
3C.2a1(a) clade was most common in 2016-2019, see flu/seasonal/h3n2/ha/6y build below:
The Nextclade dataset with reference
A/Darwin/6/2021 (EPI1857216) is more focused on recent diversity and doesn’t have a lot of sequences from before ~2020, hence
3C.2a1a is missing there:
3C.2a1a is present in the dataset with the older reference
So the answer to your question “which is correct” is likely
3C.2a1a. If you’re working with older sequences, it’s best to work with the older reference.
Your question raises a good point that it’s not obvious which references are best for which data - we should do better to point out limitations of the different datasets.
Do reach out in the future if you have other questions regarding Nextclade usage and interpretation. This is very helpful!