I’m new to both nextstrain and also covid genetics, although I do have prior experience with phylogenetics and sequence alignment. I’m working with some samples for which we are trying to determine whether they are likely members of the B.1.1.7 lineage. I’ve run ncov and in looking at the resulting tree these samples clearly cluster with the samples that have are flagged as B.1.1.7 in the pangolin metadata field from the subset of the global build that I used.
I was also looking at the alignment BAMs and the input FASTAs to make some summary plots of the mutations associated with this lineage, based on Table 1 that I found at this link. In looking at the BAM files for these samples, it is very clear to me that there is a 3 bp deletion at position 21991 which would correspond this known mutation for the B.1.1.7 lineage, which are supported by a large number of reads. It is also present in the FASTA sequences I am providing to ncov based on these sequence data. However, this deletion is missing from my Auspice results locally. I also found that this deletion does not appear to be present in the current live nextstrain global build for any samples (based on S-protein genotype “69, 144” which I expected to be “-/-” for the B.1.1.7 subclade.
This is pretty surprising to me and I’m concerned I’m missing something. In looking at the sequence-diagnostics.tsv file I see that there is a gap present that would correspond to this deletion for my own samples where there is a deletion in the BAMs, and also for samples from the global build with the B.1.1.7 annotation. It seems that this deletion, though, is being removed during the mafft step or some downstream filtering step. Is this intentional?
Apologies if I’ve missed a prior discussion about this. I did some searches prior to posting here and on the ncov github issues page but didn’t see a topic.