Nextclade set of possible lineages or distance metric

Thomas · September 10, 2024, 4:53pm

Hi,

We are trying to assign RSV lineages based on partial sequences of L as only some of the amplicons succeeded. We tested, how well the available regions is suited to obtain the lineage and realized that in about 8% of the cases the parent lineage is assigned and in about 2% of the cases another lineage is assigned. I saw in the Nextclade section under Algorithms and phylogenetic placement, that there is a distance metric assigned to each sequence and each node in the tree. I have not yet looked at the code, but wanted to ask if it is possible to extract that distance metric to each sequence and node in the tree. I could attempt to do this myself if you point me in the right direction.

It also says in that section that if multiple candidate attachment nodes with the same distance exist, Nextclade can use a “placement prior” to pick the most likely node based on its prevalence in the overall sequence data. I wonder if the miss-assignment would come from Nextclade just using the more prevalent lineage and if instead a set of lineages could be returned.

Thank you!
Thomas

ivan-aksamentov · September 18, 2024, 1:00pm

Hi @Thomas

The distance code is here:

github.com

nextstrain/nextclade/blob/48801faba4a364578350c938b80a6c33ae1e5ed5/packages/nextclade/src/tree/tree_find_nearest_node.rs#L67-L122


      
          /// Calculates distance metric between a given query sample and a tree node
          fn tree_calculate_node_distance(
            node: &AuspiceGraphNodePayload,
            qry_nuc_subs: &[NucSub],
            qry_missing: &[NucRange],
            aln_range: &NucRefGlobalRange,
            masked_ranges: &[NucRefGlobalRange],
          ) -> i64 {
            let mut shared_differences = 0_i64;
            let mut shared_sites = 0_i64;
          
            // Mask effectively turns query mutations into missing
            // Rest of logic is the same once qry_nuc_subs and qry_missing are mutated
            // Remove from qry_nuc_subs all mutations that are masked
            let masked_qry_nuc_subs = qry_nuc_subs
              .iter()
              .filter(|sub| !masked_ranges.iter().any(|range| range.contains(sub.pos)))
              .collect_vec();
          
            // Add all masked ranges to qry_missing

This file has been truncated. show original

(there are unit tests at the bottom of the file to give you some idea of how it works roughly)

And then for a given query sequence the ref nodes are classified by distance, factoring-in the priors, here:

github.com

nextstrain/nextclade/blob/48801faba4a364578350c938b80a6c33ae1e5ed5/packages/nextclade/src/tree/tree_find_nearest_node.rs#L12-L53


      
          /// Distance and placement prior for a ref tree node
          pub struct TreePlacementInfo {
            pub node_key: GraphNodeKey,
            pub distance: i64,
            pub prior: f64, // prior in non-log scale
          }
          
          /// For a given query sample, finds nearest node on the reference tree (according to the distance metric)
          pub fn graph_find_nearest_nodes(
            graph: &AuspiceGraph,
            qry_nuc_subs: &[NucSub],
            qry_missing: &[NucRange],
            aln_range: &NucRefGlobalRange,
          ) -> Result<Vec<TreePlacementInfo>, Report> {
            let masked_ranges = graph.data.meta.placement_mask_ranges();
          
            // Iterate over tree nodes and calculate distance metric between the sample and each node
            let nodes_by_placement_score = DftPre::new(graph.get_exactly_one_root()?, |node| graph.iter_children_of(node))
              .map(|(_, node)| {
                let node_payload = node.payload();

This file has been truncated. show original

(the function name and comment are confusing - it no longer picks one node, but outputs an array)

And then the actual nearest node selection is happening here:

github.com

nextstrain/nextclade/blob/48801faba4a364578350c938b80a6c33ae1e5ed5/packages/nextclade/src/run/nextclade_run_one.rs#L277-L291


      
          let nearest_node_candidates = graph_find_nearest_nodes(graph, &substitutions, &missing, &alignment_range)?;
          let nearest_node_id = nearest_node_candidates[0].node_key;
          let nearest_node = graph.get_node(nearest_node_id)?.payload();
          let nearest_node_name = nearest_node.name.clone();
          
          let nearest_nodes = params.general.include_nearest_node_info.then_some(
            nearest_node_candidates
              .iter()
              // Choose all nodes with distance equal to the distance of the nearest node
              .filter(|n| n.distance == nearest_node_candidates[0].distance)
              .map(|n| Ok(graph.get_node(n.node_key)?.payload().name.clone()))
              .collect::<Result<Vec<String>, Report>>()?,
          );
          
          let clade = nearest_node.clade();

There you will also see that there is an undocumented --include-nearest-node-info CLI arg, which adds this info to the output JSON file (--output-json). This is something our scientists use for debugging sometimes. You can probably dump this and some more information to a file and study it.

It also says in that section that if multiple candidate attachment nodes with the same distance exist, Nextclade can use a “placement prior” to pick the most likely node based on its prevalence in the overall sequence data. I wonder if the miss-assignment would come from Nextclade just using the more prevalent lineage and if instead a set of lineages could be returned.

Among official Nextstrain-maintained datasets, I think the statistics for priors is only used in SARS-CoV-2 datasets currently. You can know for sure if your tree.json contains placement_prior field.

If you think that there’s a bug in clade assignment, you can submit an issue in the GitHub repo and our team will investigate. Please provide full details: full command line invocation, problematic sequences (can be shared privately over email once a team member replies to the issue), expected vs observed results.

Thomas · October 6, 2024, 6:03pm

Hi Ivan,

I missed your reply. This is exactly what I was looking for!

Thank you!
Thomas

Topic		Replies	Views
Reference tree for RSV used in NextClade webstie	3	179	June 19, 2024
Understanding RSV clade assignment Help and Getting Started	4	337	January 12, 2024
RSV-A lineage reference sequences have discordant clade assignments Help and Getting Started	3	42	October 27, 2024
How to type RSV clades General	3	53	October 9, 2024
Regarding Extracting Nucleotide Mutations General	7	579	June 25, 2021

Nextclade set of possible lineages or distance metric

Related topics