Creating a dataset with parent-child sequence relations from Auspice tree

lpodina · March 20, 2023, 4:37pm

Hi,

I’m trying to create a dataset with parent-child sequence relationships from the Auspice tree here: https://data.nextstrain.org/files/ncov/open/global/global.json. It’s important that the parents and children are Spike amino acid sequences, and that they are the same length. So they would have to be aligned and padded until some max length.

I have managed to traverse the tree to get pairs of parent-child sequences (immediate parent/child) but I am not able to reconstruct the sequences themselves from the mutations. I suppose that’s because the mutations are written with respect to some parent sequence rather than the root sequence. This makes reconstructing the exact sequences a little hard. I’m interested in Spike translations only so my best idea so far is to traverse the tree and grab the translations through GenBank, then align the pairs to obtain mutations. However that means I’m not making use of the mutations that are already recorded within the Auspice tree. Is there a faster/easier way, or a dataset I can look through?

Finally, maybe the Auspice tree is not the best data for my purpose. Please let me know if you know of a dataset that is more suitable.

Thank you!
Lena

corneliusroemer · March 21, 2023, 1:44pm

Hi Lena, happy to help. What you are trying to do should be possible. However, there is probably an easier way to get at what you want. Can you roughly outline what you are trying to do with the sequences?

lpodina · March 21, 2023, 2:00pm

Hi Cornelius,

I am creating this dataset so that I could train a machine learning model on it. The eventual goal is to predict mutations for a new parent sequence that it hasn’t seen yet. But for that it needs to see a whole bunch of samples of parent-child pairs of sequences where the child has definitely mutated from the parent.

corneliusroemer · March 21, 2023, 2:29pm

Thanks for the explanation!

The auspice.json contains mutations that happened on each branch (=edge) leading into a node (vertex) as seen from the root.

So to reconstruct the state at a particular (internal) node all you have to do is start from the root with the reference Spike sequence and mutate it based on what’s in branch_attrs.mutations.S. You can ignore the first base (the L) in the notation L452R as all you need to know is that from that node onwards, 452 is in state R unless it gets mutated again. I think we have a script somewhere that does exactly this but I couldn’t find it.

The auspice.json is actually constructed using the full sequence information (contained in output of augur ancestral). But this is workflow internal so may not be accessible publicly.

An alternative is to use this file: pango-sequences/pango-consensus-sequences_summary.json at main · corneliusroemer/pango-sequences · GitHub which contains full sequence information (including parent) for each pango lineage.

While this may not be big data enough, it has the advantage of being well curated so there are fewer artefacts/tree building errors. The children are not individual sequences though, but ancestors of notable clusters.

If you use the auspice.json you linked to, note that the tree contains a lot of tree errors due to homoplasy (same mutation happening independently) and sequencing artefacts. It’s only a very rough approximation of the truth.

Another dataset you could use which is much bigger is the Usher tree that is curated by Angie Hinrichs. The public tree contains ~6m sequences annotated with mutations (except deletions and insertions) in a well structured and documented protobuf: Index of /goldenPath/wuhCor1/UShER_SARS-CoV-2

Do I understand correctly that you want to predict mutations typically seen in children based on parental sequences? Or the other way round? Parent from a child?

lpodina · March 21, 2023, 9:55pm

Thank you for the detailed answer. Just to confirm, for the tree I have linked, if sequence A is an ancestor of another sequence B, is it certain that B mutated from A? Moreover this is also true if A is the (a?) parent of B? I am trying to predict childrens’ mutations from the parent, so if the relationships are not exact, then perhaps it would be best if I used a different dataset. I will have a look at the other public tree you have linked.

Also, if you happen to know of any work that has done this exact task, please let me know! It doesn’t matter to me whether they have used ML or not but I am taking an ML approach. If you don’t know of any such work that’s fine, I just thought if you knew some off the top of your head then it would be great if you could let me know.

Thanks!

lpodina · March 22, 2023, 4:19pm

Just another quick question: I think the Usher tree you have linked will be great for my purposes, but I am unfamiliar with the structure or how to parse it. How can I obtain a json or tree-like structure from the files? If you know of any resources it would be great if you could link them. Thank you!

corneliusroemer · March 23, 2023, 5:53pm

Just to confirm, for the tree I have linked, if sequence A is an ancestor of another sequence B, is it certain that B mutated from A?

I assume you mean the link in the original question: https://data.nextstrain.org/files/ncov/open/global/global.json

It is likely but not certain that the actual mutation path taken by B went via A.

Moreover this is also true if A is the (a?) parent of B? I am trying to predict childrens’ mutations from the parent, so if the relationships are not exact, then perhaps it would be best if I used a different dataset. I will have a look at the other public tree you have linked.

I think you have a misconception here: there is no such thing as a “true parent”, in the sense that we know exactly which virus led to which. By parent we mean “ancestor” not “parent” as opposed to “grandparent”. There’s not really such a concept as “direct parent” (and not grandparent etc.) in phylogenies. All you know is that B evolved from A. But A will have had many kids, and they will have had many kids etc.

Also, if you happen to know of any work that has done this exact task, please let me know! It doesn’t matter to me whether they have used ML or not but I am taking an ML approach. If you don’t know of any such work that’s fine, I just thought if you knew some off the top of your head then it would be great if you could let me know.

Trying to predict evolution is a long standing problem in Influenza for example. People have tried all sorts of things, using sequences and experimental data etc.

You can try using big trees and ML approach but usually when outsiders (non phylogenetics/virus evolution) people try their favorite ML approach they miss a lot of nuance in the data and that leads to results that aren’t very useful. But it’s a fun thing to try for sure! And who knows, maybe you do make a breakthrough

Beware that if you want to truly predict something, you need to make sure that you have a date cutoff: you can only use a tree built on date X to validate your predictions on what may or may not happen at date Y. If you take one tree and split into test/train, you will have contamination.

For what you describe, I would definitely use the Usher tree as it is much bigger and fine grained.

corneliusroemer · March 23, 2023, 5:56pm

I think the Usher tree you have linked will be great for my purposes, but I am unfamiliar with the structure or how to parse it. How can I obtain a json or tree-like structure from the files? If you know of any resources it would be great if you could link them. Thank you!

The docs are here: matUtils — usher_wiki 0.0.2 documentation

This is not a Nextstrain project, and I haven’t worked with these protobuf files myself, so best ask the devs directly or post somewhere like https://www.biostars.org/ - few people on this forum will be able to help. But the docs should help.

Topic		Replies	Views
Make Auspice entropy panel show mutations with respect to arbitrary sequence (reference) rather than (reconstructed) root of the tree Help and Getting Started	2	404	February 15, 2023
Nextstrain phylogenetic tree in the format used by UCSC General	8	106	November 6, 2024
Extract nt and AA mutation info per branch from JSON after augur ancestral/translate/export General	6	755	March 22, 2021
Entropy panel data Help and Getting Started	1	933	June 30, 2020
Unable to view data on auspice Help and Getting Started	4	673	July 26, 2022

Creating a dataset with parent-child sequence relations from Auspice tree

Related topics