Extract nt and AA mutation info per branch from JSON after augur ancestral/translate/export

Hello! Non-SARS-CoV-2 post - let’s talk about syphilis.

I’m using augur translate to lay synonymous mutations onto my whole genome phylogeny of 242 T pallidum pallidum genomes, most of which I’ve recently assembled and are not public yet. It’s working brilliantly, and the visualization in auspice is fantastic. However, I’d really like to be able to get under the hood a bit more and extract all of the nt and AA mutation per branch. I know it is all in the relevant JSON files and I can manually-ish parse it after removing some header information to allow reading into R, but was hoping you all might have some better tools and tips for how to get the information - I’m hopeful I’m just missing an optional argument to include the information in a nexus tree or something! Whatever script is used to project the mutation data interactively onto the branches might be helpful?

Also, I am pretty good in R but dreadful in python, so I may just be displaying my lack of knowledge of existing pipelines that would make this simple… apologies if that is the case!

Just for scale, we are talking on the order of a few thousand nt mutations along the different branches, and several hundred at the AA level.

Happy to share any JSON files or scripts that would be helpful.

Thanks a million!

Hi @nicole.lieberman – excited to hear that Nextstrain is being used for more than viruses!

I’d really like to be able to get under the hood a bit more and extract all of the nt and AA mutation per branch

The output JSON from augur ancestral will infer the nucleotide mutations on each branch of the tree, and the output of augur translate will be the same, but translated into codons. The branch names themselves should be available in the newick tree file produced by augur refine (don’t use the file from augur tree, as this doesn’t have the names of internal nodes, which you’ll need). All of these files are combined into the final dataset JSON for visualisation, but since you have access to the intermediate files it’s easier to work from them.

Let us know if there’s more we can help with here!

Hi @james, thanks! I think I’ve been able to pull everything I’ve needed from the output json file from augur translate, my challenge has just been that it’s been a bit of a beast to tidy up - text editor to remove a bunch of header material (the part that echos the .gbk reference) from the json file, delimit it by spaces in excel to allow sort-of appropriate read in to R, then do battle with dplyr a bunch to whittle it down to basically just a table of tip/node, gene, and mutation. So totally do-able, but just wasn’t sure if you had a better method to deal with it than my hacking away!

After looking further at the data, one question I realize I DO have is what the nt and AA mutations are relative to - are they all to the inferred root sequence, or to the most recently inferred ancestral node? I am assuming the former since that is the only reference sequence included in the translate json output (and in the final augur export root-sequence file)?

Thank you!

It’s been a while since I used R, but I would think you can simply read in the JSON files without any modifications, select the nodes field, and then parse that with dplyr or similar.

what the nt and AA mutations are relative to

Each branch defines mutations relative to its parent. So, for example, when auspice displays “Mutations from root” when you click on a tip, we are traversing the tree and collecting all the mutations between the root (NODE_0000000) and the tip. This approach also needs to be aware of potential reversions and double mutations, etc. Also note that the inferred root sequence is often not the same as the reference!

hi James,

OK thanks for the clarification on the mutations relative to its parent!

(I have yet to find a native R or json package that can successfully read in the json files generated by augur since R apparently doesn’t have the flexibility to understand the different structures in different parts of the document - it’s a known limitation of R (who knew!). Hence needing to break it up.)


I did not know that about R! If you are comfortable with python you could use the following small script to extract mutations out per-gene, modifying the script to get the desired format for your use-case in R:

# filename: parse_node_data.py
import json
import sys
with open(sys.argv[1], 'rU') as fh:
  node_data = json.load(fh)
for name, data in node_data["nodes"].items():
  if sys.argv[2]=="nuc":
    muts = ",".join(data.get("muts", []))
    muts = ",".join(data.get("aa_muts", {}).get(sys.argv[2], []))

Which you can run as follows:

 # ENV gene (AA) mutations, per node (relative to parent)
python parse_node_data.py results/aa_muts.json ENV > mutations_ENV.txt
 # Nucelotide mutations, per node (relative to parent)
python parse_node_data.py results/nt_muts.json nuc > mutations_nuc.txt
1 Like

Oh great! Thank you. I don’t python all that well but have people that can help me :slight_smile:

Much appreciated.