Hi Tomas @tomas_gnt,
As mentioned in the documentation, these columns are produced by Nextclade dynamically from the information in the input dataset, more specifically from the reference tree (tree.json
file). Each of the attributes is treated like an extra clade
and the result of each attribute is an additional entry in the output (e.g. additional column in the output TSV file, like Jover suggested above). There is no comprehensive list of these attributes, and there cannot be - the presence and names of these additional fields depend on pathogen and concrete dataset’s authors’ decision in a concrete version of a dataset.
Or did I miss something in the documentation?
Probably not. There is not much to document about it in Nextclade itself, as the software does not control these attributes - Nextclade is agnostic to pathogen specifics. Perhaps our science team could document additional attributes for each of the official datasets’ README.md
. But again, this will be up to the individual dataset authors in case of community datasets.
Do I just need to parse this info from the json files?
It really depends on what you want to achieve. This is the additional information, and it’s only a tiny portion of the output. Most Nextclade users can probably live without it.
If you want to process all the available information, you can perhaps treat any extra outputs as potentially missing (null). For example, if you process output TSV with pandas
, and the column is missing, set the value to None
.
If you absolutely need to know all the attributes in advance, then yes, you could enumerate names of official and community datasets with nextclade dataset list --only-names
, then for each name download the dataset nextclade dataset get --name=${name} ...
. and then use for example jq
or a python script to extract the attribute names from each of the downloaded tree.json
files. You can also get the same information from our data repo, where the dataset releases are prepared: GitHub - nextstrain/nextclade_data: Datasets for https://github.com/nextstrain/nextclade, but this is an implementation detail, subject to change, and I would not recommend to rely on that.
If you want a stable list of attributes (for example if you require reproducible experiment results), you can stick to a particular version of a dataset, but this way you won’t receive pathogen-specific updates.
Can I be sure that the names of the clade-like columns will remain the same in the future?
No, but these changes are versioned for each dataset, and should be reflected in the dataset’s changelog. Also, we are unlikely to break things unless we think this is absolutely necessary, and if it happens, we try to minimize impact and to provide a migration path. But we cannot control all the existing datasets in the wild, only the ones you see in the nextclade dataset list
or in the data repo.
Let us know what your use-case is, and whether you have any ideas on how to improve Nextclade.