We are trying to assign RSV lineages based on partial sequences of L as only some of the amplicons succeeded. We tested, how well the available regions is suited to obtain the lineage and realized that in about 8% of the cases the parent lineage is assigned and in about 2% of the cases another lineage is assigned. I saw in the Nextclade section under Algorithms and phylogenetic placement, that there is a distance metric assigned to each sequence and each node in the tree. I have not yet looked at the code, but wanted to ask if it is possible to extract that distance metric to each sequence and node in the tree. I could attempt to do this myself if you point me in the right direction.
It also says in that section that if multiple candidate attachment nodes with the same distance exist, Nextclade can use a “placement prior” to pick the most likely node based on its prevalence in the overall sequence data. I wonder if the miss-assignment would come from Nextclade just using the more prevalent lineage and if instead a set of lineages could be returned.
(there are unit tests at the bottom of the file to give you some idea of how it works roughly)
And then for a given query sequence the ref nodes are classified by distance, factoring-in the priors, here:
(the function name and comment are confusing - it no longer picks one node, but outputs an array)
And then the actual nearest node selection is happening here:
There you will also see that there is an undocumented --include-nearest-node-info CLI arg, which adds this info to the output JSON file (--output-json). This is something our scientists use for debugging sometimes. You can probably dump this and some more information to a file and study it.
It also says in that section that if multiple candidate attachment nodes with the same distance exist, Nextclade can use a “placement prior” to pick the most likely node based on its prevalence in the overall sequence data. I wonder if the miss-assignment would come from Nextclade just using the more prevalent lineage and if instead a set of lineages could be returned.
Among official Nextstrain-maintained datasets, I think the statistics for priors is only used in SARS-CoV-2 datasets currently. You can know for sure if your tree.json contains placement_prior field.
If you think that there’s a bug in clade assignment, you can submit an issue in the GitHub repo and our team will investigate. Please provide full details: full command line invocation, problematic sequences (can be shared privately over email once a team member replies to the issue), expected vs observed results.