How can I know what columns will be in nextclade output?

I am trying to update the Galaxy wrapper for nextclade. The current (very outdated) wrapper maps dataset name to column names to use in the output and producing this list was a manual process. So: is there a way to computationally get a list of the columns that will be output for a given dataset in nextclade?

Thanks!

Hi Peter,
I don’t think there is a canonical way to do that. The output columns depend on what is specified in the pathogen.json and whether there are any other clade-like attributed specified in the tree.

One could in principle parse this out of the file in the dataset. But the easiest would probably be to run nextclade on the reference sequence and use the header of the resulting TSV file to generate the list of output-columns.

I also replied on ubioinfo Slack here:

Thanks Richard and Ivan

I’ve written a script that tries to infer columns from the output of running nextclade. The comment about dynamic columns confuses me a bit though, so just to be clear: for the same dataset will the columns output be the same for every sequence processed? Or could some feature in the sequence change the columns being output?

Best regards,
Peter

I believe that the set of columns should be the same across multiple runs if they use:

  • the same dataset and dataset version (i.e. dataset files are exactly the same)
  • the same nextclade version
  • the same value of --output-columns-selection arg

Currently, if these requirements are met, then the set of columns will be the same for any input sequences.

P.S. The word “dynamic” I use is a bit of a dev slang, meaning that these are the columns which presence is decided on runtime, after parsing dataset files (tree.json and pathogen.json) and these columns are not hardcoded in software. Dataset authors decide on presence or absence of these columns.

Ok thanks this makes sense. The PR I am working on is here: Update nextclade to version 3.15.3 by pvanheus · Pull Request #7161 · galaxyproject/tools-iuc · GitHub, which includes the datasets_to_macros.py script. This script captures what you said - it captures the nextclade version and produces a column specification given the reference sequence of the dataset and the dataset itself.

I have yet to add support for --output-columns-selection but now that I have an automated way to get the dataset to column mapping for a particular nextclade version, adding support for selecting columns should be straightforward.

As an aside, I had to special case the “Influenza A H1N1pdm HA” and “Influenza A H3N2 HA” datasets because they are re-used, with two different references (essentially a recent reference and an older one for the “broad” version). It would be nice to not need to deal with this special case. :slight_smile:

I have almost the same question, but a bit narrower: for a given dataset, I want to know the names of all clade-ish output columns, and only those. (Then I can run nextclade run with --output-columns-selection seqName,clade,<other clade-ish columns>.) It looks like those column names can be predicted from the output of nextclade dataset list --json – and then I don’t have to run nextclade run and do heuristics on the column names to identify the clade-ish/lineage-ish column names.

For example, the .capabilities part of the JSON looks like this for dataset nextstrain/flu/h1n1pdm/ha/CY121680:

    "capabilities": {
      "clades": 25,
      "customClades": {
        "proposedSubclade": 26,
        "short-clade": 16,
        "subclade": 26
      },

Since .capabilities.clades is present, I can expect an output column clade. The contents of .capabilities.customClades implies that the output will also have columns short-clade, subclade and proposedSubclade.

I verified that particular example by running nextclade run on the dataset nextstrain/flu/h1n1pdm/ha/CY121680 with its reference sequence.

Is that a reasonable approach, or is the nextclade dataset list --json output likely to change in a way that would invalidate it?

Hi Angie @AngieHinrichs

You can bypass nextclade entirely and list clade-like attrs directly from the ref tree json metadata:

jq '.meta.extensions.nextclade.clade_node_attrs[].name' tree.json
"Nextclade_pango"
"partiallyAliased"
"clade_nextstrain"
"clade_who"
"clade_display"

The nextclade dataset list --json is showing you parts of the index.json file. For production use, the same index.json is deployed at https://data.clades.nextstrain.org/v3/index.json. This file tells nextclade software what are all the existing datasets in our data repo.

The capabilities field you see is derived from the same tree json metadata in the nextclade_data’s CI script called rebuild. Then it also counts unique values for each attr on tree nodes - that’s where the numbers in parentheses come from.

You can bypass all that and get this data from the tree file directly. Other dynamic attributes come from pathogen.json very similarly

Currently all these json formats are considered unstable, i.e. we haven’t have time to design and maintain proper stable json formats. These jsons are a dump of internal structs of nextclade rust code (via serde crate). This means with any code modification there is a risk that the structs change and so also json dumps.

I am thinking to add json schema definitions so that at least people can generate parsers and validate data. Would that help?

Let us know folks how we can simplify this for you - could be as simple as amending this rebuild script in the data repo or maybe a new feature in nextclade software. Feel free to also contribute directly into nextclade or nextclade_data repos if you have time/forces.

Perhaps there could be a better way where you don’t need to know a list of all columns? Or maybe nextclade can output some missing piece of data to hint on what’s available?

It seems the approach where you query dataset files (tree and pathogen json) is more meaningful, because we try to keep datasets format more or less stable - we can add new things, but always hesitate to remove or change existing things.

This is needed to ensure compatibility with datasets people create in random places on the internet, outside of our data repo and even without our knowledge, or even outside internet, privately.

This means if you process dataset files directly, this code could last longer (probably until next major version of nextclade) compared to code which parses json outputs from nextclade, which are super volatile.

Any discussions or contributions regarding more stable rich formats like json (yaml, protobuf, and what have you) are very welcome. We don’t always know what downstream projects are doing, so it’s hard for us to design these formats well. Also, lack of resources as mentioned.

Thanks @ivan-aksamentov! It’s good to know that the dataset format is more stable than the index and internal structures.

I probably would not make use of a schema. (@pvanheus ?)

Perhaps there could be a better way where you don’t need to know a list of all columns?

I need a machine-readable set of only the clade/lineage/etc. columns so my script can formulate a --output-columns-selection to pass to nextclade run (and to use in visualization downstream). I think Peter needs all columns so that he can let his software know what to expect in the absence of --output-columns-selection but better let Peter speak to that. :slight_smile:

Or maybe nextclade can output some missing piece of data to hint on what’s available?

There’s enough data to figure it out. Thanks for being open to the possibility of a PR though.

I should have mentioned that I am planning to use nextclade dataset list --search, and would like to keep only the resulting datasets that include at least some dataset annotations. So for example if my user’s search term is “influenza” and I run nextclade dataset list --search flu then I would want to include most of the resulting datasets, but not nextstrain/flu/h3n2/pb1 and other QC-only datasets. I could download 27 datasets and inspect the tree.json in each one – but I might just take my chances with the JSON output (or perhaps index.json, thanks for the pointer). :slight_smile:

(I’m interested in only the datasets that have clade annotations, and only the clade output columns, because I’m developing a tool intended to make it as easy as possible to build an UShER tree for any virus that has genomes in GenBank/INSDC and at least one RefSeq sequence. So if the user types “dengue” then I search for that in NCBI taxonomy, then use the taxonomy ID to search for RefSeqs and GenBank genomes, then align GenBank genomes to the selected RefSeq using nextclade of course, then build a tree. I’d like to also search for Nextclade datasets that could be used to annotate clades/lineages/etc.)

I am indeed using all columns and, in the latest code, picking them up on the fly. Because datasets can change after the Galaxy tool is published, I cannot make use of --output-columns-selection because the list of columns to select from might be out of sync with what is in the dataset. That said, perhaps a “minimal set” would be a useful thing to support.

BTW just mentioning again that there are two flu datasets with the same description. I’ll raise that as an issue on the datasets repo.