How to parse node data JSONs from the ncov workflow (or any Nextstrain workflow)

We recently got a question from a Nextstrain user about the best way to parse the JSON output of the ncov workflow. For example, the ncov workflow runs augur traits by default to infer the ancestral country for each internal node of the tree. The workflow writes these data to files we refer to as “node data JSONs”. We don’t have a how-to guide yet for how to parse these files, but the following code examples show how to get started parsing these files.

Each node data JSON from Augur contains a dictionary with a nodes key that indexes another dictionary by strain (or internal node) name. From a Python terminal window the following steps will load ancestral trait values into memory (showing an example based on our Nextstrain build for Europe):

>>> from augur.utils import read_node_data
>>> traits = read_node_data("results/europe/traits.json")
>>> traits_by_node_name = traits["nodes"]

The traits JSON also stores information about the discrete trait analysis model in a top level key named model. This key points to a dictionary of all discrete trait columns that were inferred by augur traits. In the example data above, the only trait inferred was “country”. Indexing the traits data by models and country gives us a dictionary with these keys:

>>> traits["models"]["country"].keys()
dict_keys(['alphabet', 'equilibrium_probabilities', 'rate', 'transition_matrix'])

Note that Augur’s read_node_data function accepts multiple JSON filenames as input; it will combine the data from these multiple files into a single dictionary.

You can use a similar approach to extract these values from R with the jsonlite library like so:

traits <- jsonlite::fromJSON("results/europe/traits.json")

A while back, I wrote a script to parse these JSON files into TSV/CSV files. This script is a little out-of-date, so I’ve posted a simpler, updated version on GitHub as a gist with instructions on how to use it. Once you have Augur installed, you can run the script like so:

python3 node_data_to_table.py \
  --tree results/europe/tree.nwk \
  --jsons results/europe/traits.json \
  --include-internal-nodes \
  --annotations build=europe \
  --output traits.tsv

Run the script with the -h flag to get more details about what the different arguments do.

python3 node_data_to_table.py -h