List of clade-like attributes for each organism

Hi,

is there somewhere a list of clade like attributes specific for each organism?
In the documentation (Analysis results (tabular) — Nextclade documentation) it is just mentioned that “The table can contain additional columns for every clade-like attribute defined in reference tree in meta.extensions.clade_node_attrs and in the node attributes”.

How do I find these attributes? I noticed that this info is contained within the tree.json file (for example for rsv a here → nextclade_data/data/nextstrain/rsv/a/EPI_ISL_412866/tree.json at master · nextstrain/nextclade_data · GitHub) but I could not find any info regarding these columns in the documentation.

Do I just need to parse this info from the json files? Or did I miss something in the documentation?
Can I be sure that the names of the clade-like columns will remain the same in the future?

Thank you in advance. Best,
Tomas

Hi @tomas_gnt ,

I cannot find any documentation on these dataset specific columns either. I am able to find these columns by running example data with the reference dataset on clades.nextstrain.org then exporting the dataset specific columns

@corneliusroemer may have better pointers on where these columns are documented?

Best,
Jover

Hi Tomas @tomas_gnt,

As mentioned in the documentation, these columns are produced by Nextclade dynamically from the information in the input dataset, more specifically from the reference tree (tree.json file). Each of the attributes is treated like an extra clade and the result of each attribute is an additional entry in the output (e.g. additional column in the output TSV file, like Jover suggested above). There is no comprehensive list of these attributes, and there cannot be - the presence and names of these additional fields depend on pathogen and concrete dataset’s authors’ decision in a concrete version of a dataset.

Or did I miss something in the documentation?

Probably not. There is not much to document about it in Nextclade itself, as the software does not control these attributes - Nextclade is agnostic to pathogen specifics. Perhaps our science team could document additional attributes for each of the official datasets’ README.md. But again, this will be up to the individual dataset authors in case of community datasets.

Do I just need to parse this info from the json files?

It really depends on what you want to achieve. This is the additional information, and it’s only a tiny portion of the output. Most Nextclade users can probably live without it.

If you want to process all the available information, you can perhaps treat any extra outputs as potentially missing (null). For example, if you process output TSV with pandas, and the column is missing, set the value to None.

If you absolutely need to know all the attributes in advance, then yes, you could enumerate names of official and community datasets with nextclade dataset list --only-names, then for each name download the dataset nextclade dataset get --name=${name} .... and then use for example jq or a python script to extract the attribute names from each of the downloaded tree.json files. You can also get the same information from our data repo, where the dataset releases are prepared: GitHub - nextstrain/nextclade_data: Datasets for https://github.com/nextstrain/nextclade, but this is an implementation detail, subject to change, and I would not recommend to rely on that.

If you want a stable list of attributes (for example if you require reproducible experiment results), you can stick to a particular version of a dataset, but this way you won’t receive pathogen-specific updates.

Can I be sure that the names of the clade-like columns will remain the same in the future?

No, but these changes are versioned for each dataset, and should be reflected in the dataset’s changelog. Also, we are unlikely to break things unless we think this is absolutely necessary, and if it happens, we try to minimize impact and to provide a migration path. But we cannot control all the existing datasets in the wild, only the ones you see in the nextclade dataset list or in the data repo.

Let us know what your use-case is, and whether you have any ideas on how to improve Nextclade.