I’m trying to create a dataset with parent-child sequence relationships from the Auspice tree here: https://data.nextstrain.org/files/ncov/open/global/global.json. It’s important that the parents and children are Spike amino acid sequences, and that they are the same length. So they would have to be aligned and padded until some max length.
I have managed to traverse the tree to get pairs of parent-child sequences (immediate parent/child) but I am not able to reconstruct the sequences themselves from the mutations. I suppose that’s because the mutations are written with respect to some parent sequence rather than the root sequence. This makes reconstructing the exact sequences a little hard. I’m interested in Spike translations only so my best idea so far is to traverse the tree and grab the translations through GenBank, then align the pairs to obtain mutations. However that means I’m not making use of the mutations that are already recorded within the Auspice tree. Is there a faster/easier way, or a dataset I can look through?
Finally, maybe the Auspice tree is not the best data for my purpose. Please let me know if you know of a dataset that is more suitable.