Good morning,
I noticed a difference in the clade assignment between the influenza references.
I have 7 sequences assigned 3C.2a1a with reference A/Wisconsin/67/2005 (CY163680) which are assigned 3C.2a1 with reference A/Darwin/6/2021 (EPI1857216).
These 7 samples have the mutations: HA1:N121K + HA1:T135K + HA2:G150E attributed to clade 3C.2a1a.
I also have this problem on a 3C.2a4 sample with ref A/Wisconsin/67/2005 (CY163680) which becomes 3C.2a with ref A/Darwin/6/2021 (EPI1857216).
Which of the two references gives the correct clade?
Thank you
Nice Day
Lorlane Le Targa
pHD students
1 Like
Hi Lorlane,
Thank you for reaching out with this question and apologies for the delay in getting back to you.
This is a great question. The 3C.2a1(a)
clade was most common in 2016-2019, see flu/seasonal/h3n2/ha/6y build below:
The Nextclade dataset with reference A/Darwin/6/2021 (EPI1857216)
is more focused on recent diversity and doesn’t have a lot of sequences from before ~2020, hence 3C.2a1a
is missing there:
While 3C.2a1a
is present in the dataset with the older reference A/Wisconsin/67/2005 (CY163680)
:
So the answer to your question “which is correct” is likely 3C.2a1a
. If you’re working with older sequences, it’s best to work with the older reference.
Your question raises a good point that it’s not obvious which references are best for which data - we should do better to point out limitations of the different datasets.
Do reach out in the future if you have other questions regarding Nextclade usage and interpretation. This is very helpful!
Best,
Cornelius
1 Like