Creating a dataset with parent-child sequence relations from Auspice tree

Hi,

I’m trying to create a dataset with parent-child sequence relationships from the Auspice tree here: https://data.nextstrain.org/files/ncov/open/global/global.json. It’s important that the parents and children are Spike amino acid sequences, and that they are the same length. So they would have to be aligned and padded until some max length.

I have managed to traverse the tree to get pairs of parent-child sequences (immediate parent/child) but I am not able to reconstruct the sequences themselves from the mutations. I suppose that’s because the mutations are written with respect to some parent sequence rather than the root sequence. This makes reconstructing the exact sequences a little hard. I’m interested in Spike translations only so my best idea so far is to traverse the tree and grab the translations through GenBank, then align the pairs to obtain mutations. However that means I’m not making use of the mutations that are already recorded within the Auspice tree. Is there a faster/easier way, or a dataset I can look through?

Finally, maybe the Auspice tree is not the best data for my purpose. Please let me know if you know of a dataset that is more suitable.

Thank you!
Lena

Hi Lena, happy to help. What you are trying to do should be possible. However, there is probably an easier way to get at what you want. Can you roughly outline what you are trying to do with the sequences?

Hi Cornelius,

I am creating this dataset so that I could train a machine learning model on it. The eventual goal is to predict mutations for a new parent sequence that it hasn’t seen yet. But for that it needs to see a whole bunch of samples of parent-child pairs of sequences where the child has definitely mutated from the parent.

1 Like

Thanks for the explanation!

The auspice.json contains mutations that happened on each branch (=edge) leading into a node (vertex) as seen from the root.

So to reconstruct the state at a particular (internal) node all you have to do is start from the root with the reference Spike sequence and mutate it based on what’s in branch_attrs.mutations.S. You can ignore the first base (the L) in the notation L452R as all you need to know is that from that node onwards, 452 is in state R unless it gets mutated again. I think we have a script somewhere that does exactly this but I couldn’t find it.

The auspice.json is actually constructed using the full sequence information (contained in output of augur ancestral). But this is workflow internal so may not be accessible publicly.

An alternative is to use this file: pango-sequences/pango-consensus-sequences_summary.json at main · corneliusroemer/pango-sequences · GitHub which contains full sequence information (including parent) for each pango lineage.

While this may not be big data enough, it has the advantage of being well curated so there are fewer artefacts/tree building errors. The children are not individual sequences though, but ancestors of notable clusters.

If you use the auspice.json you linked to, note that the tree contains a lot of tree errors due to homoplasy (same mutation happening independently) and sequencing artefacts. It’s only a very rough approximation of the truth.

Another dataset you could use which is much bigger is the Usher tree that is curated by Angie Hinrichs. The public tree contains ~6m sequences annotated with mutations (except deletions and insertions) in a well structured and documented protobuf: Index of /goldenPath/wuhCor1/UShER_SARS-CoV-2

Do I understand correctly that you want to predict mutations typically seen in children based on parental sequences? Or the other way round? Parent from a child?

Thank you for the detailed answer. Just to confirm, for the tree I have linked, if sequence A is an ancestor of another sequence B, is it certain that B mutated from A? Moreover this is also true if A is the (a?) parent of B? I am trying to predict childrens’ mutations from the parent, so if the relationships are not exact, then perhaps it would be best if I used a different dataset. I will have a look at the other public tree you have linked.

Also, if you happen to know of any work that has done this exact task, please let me know! It doesn’t matter to me whether they have used ML or not but I am taking an ML approach. If you don’t know of any such work that’s fine, I just thought if you knew some off the top of your head then it would be great if you could let me know.

Thanks!

Just another quick question: I think the Usher tree you have linked will be great for my purposes, but I am unfamiliar with the structure or how to parse it. How can I obtain a json or tree-like structure from the files? If you know of any resources it would be great if you could link them. Thank you!

Just to confirm, for the tree I have linked, if sequence A is an ancestor of another sequence B, is it certain that B mutated from A?

I assume you mean the link in the original question: https://data.nextstrain.org/files/ncov/open/global/global.json

It is likely but not certain that the actual mutation path taken by B went via A.

Moreover this is also true if A is the (a?) parent of B? I am trying to predict childrens’ mutations from the parent, so if the relationships are not exact, then perhaps it would be best if I used a different dataset. I will have a look at the other public tree you have linked.

I think you have a misconception here: there is no such thing as a “true parent”, in the sense that we know exactly which virus led to which. By parent we mean “ancestor” not “parent” as opposed to “grandparent”. There’s not really such a concept as “direct parent” (and not grandparent etc.) in phylogenies. All you know is that B evolved from A. But A will have had many kids, and they will have had many kids etc.

Also, if you happen to know of any work that has done this exact task, please let me know! It doesn’t matter to me whether they have used ML or not but I am taking an ML approach. If you don’t know of any such work that’s fine, I just thought if you knew some off the top of your head then it would be great if you could let me know.

Trying to predict evolution is a long standing problem in Influenza for example. People have tried all sorts of things, using sequences and experimental data etc.

You can try using big trees and ML approach but usually when outsiders (non phylogenetics/virus evolution) people try their favorite ML approach they miss a lot of nuance in the data and that leads to results that aren’t very useful. But it’s a fun thing to try for sure! And who knows, maybe you do make a breakthrough :slight_smile:

Beware that if you want to truly predict something, you need to make sure that you have a date cutoff: you can only use a tree built on date X to validate your predictions on what may or may not happen at date Y. If you take one tree and split into test/train, you will have contamination.

For what you describe, I would definitely use the Usher tree as it is much bigger and fine grained.

I think the Usher tree you have linked will be great for my purposes, but I am unfamiliar with the structure or how to parse it. How can I obtain a json or tree-like structure from the files? If you know of any resources it would be great if you could link them. Thank you!

The docs are here: matUtils — usher_wiki 0.0.2 documentation

This is not a Nextstrain project, and I haven’t worked with these protobuf files myself, so best ask the devs directly or post somewhere like https://www.biostars.org/ - few people on this forum will be able to help. But the docs should help.