Samples identified as belonging to a parent clade

Hello all,

I have a sample that according to NextClade should be in 20H/501Y.V2 that when I produce the tree for all my samples is placed into 20C. I have manually verified the clade assignment based on the BAM file for the sample, and when I color the data by nucleotide and put in the two locations that separate 20H/501Y.V2 from 20C, both are present.

I have verified that the clades.tsv file that I’m using is up to date, and I’m struggling to think what else I could be missing that’s causing this error. Samples in 20I/501Y.V1 are all correctly identified, so I know that it’s at least able to correctly identify when there is an S:N501Y mutation.

Any suggestion as to where I’m going wrong with this is greatly appreciated!

Thanks in advance,
Hannah

Edit: I have solved the problem in what feels to be a somewhat hacky way, by adding another sample that is in 20H/501Y.V2. This seems to encourage the algorithm to recognize that our sample is indeed a variant.

However, I feel that there likely is another solution that would allow me to have only samples from my area, but still get the correct clades. If anyone has an idea as to where I’m going wrong I would appreciate knowing.

Thanks,
Hannah

Hello Hannah,

the clade assignment in the ncov pipeline works by identifying signature mutations and then labels the largest clade with these signature mutations. This can sometimes go wrong when the tree doesn’t have sufficient background diversity. This might have been the case in your example.

best,
richard

I agree with Richard here. My best guess at what happened here is that the single 20H/501Y.V2 sample was N for one of the signature mutations here: ncov/clades.tsv at master · nextstrain/ncov · GitHub, but adding the 2nd sample made it clear this clade had all these signature mutations.

I think that clades are allowed to have just a single representative (as intended), but I’d have to confirm this to be sure.