`nextclade sort` Using a serverless local dataset for sorting

Greetings everyone!

I see that nextclade sort accepts a --server flag for querying a hosted dataset collection to sort input sequences. However, this would be a bit awkward for my use case on a local machine.

Is it currently possible to ask nextclade sort to reference local directory on my machine structured according to the Nextclade dataset guidelines ? We’d like to avoid the use of a custom local dataset server for now.

Thanks!

Context: We have many custom datasets for influenza B genome segment builds. Nextclade sort would be an excellent solution to automating our internal nextstrain builds but official Influenza B Nextclade datasets for non HANA segments are do not appear to be regularly maintained.

Hi @eakin

Nextclade dev here.

Happy to see there’s interest for the sort command.

I believe the only thing the sort command needs from the server is so-called “minimizer index” file. You can download it and provide it using --input-minimizer-index-json (shortcut -m).

$ nextclade sort --help
...
  -m, --input-minimizer-index-json <INPUT_MINIMIZER_INDEX_JSON>
          Path to input minimizer index JSON file.
          
  By default, the latest reference minimizer index is fetched from the dataset server (default or customized with `--server` argument). If this argument is provided, the algorithm skips fetching the default index and uses the index provided in the JSON file.
          
  Supports the following compression formats: "gz", "bz2", "xz", "zst". Use "-" to read uncompressed data from standard input (stdin).
...

The “minimizer index json” is a large JSON file which contains a map of dataset names to the “minimizers” - hashed sequence fragments.

Release versions of Nextclade simply download it from https://data.clades.nextstrain.org/v3/minimizer_index.json

You can also find the latest (unstable) version of the minimizer_index.json in the data repo’s data_output/ directory here (which is the current snapshot of the dataset server) on master branch.

The stable version is in the same place, but on release branch.

So what you can do is to download minimizer_index.json file on internet-enabled machine:

curl -fsSLo minimizer_index.json https://data.clades.nextstrain.org/v3/minimizer_index.json

and then provide it to the sort command like this on your offline machine:

nextclade sort -m minimizer_index.json ...

I believe in this case Nextclade CLI won’t make any network requests. If I understood correctly, that’s your goal.

For advanced use-cases, e.g. if you have your own datasets and references and you want to be able to detect/sort based on them, you could generate your own minimizer index file and use it offline or on your own dataset server. In our data repo the minimizer index is prepared as a part of the rebuild script here. And the prototype of the sorting/detection algo is here - it’s the same thing the Nextclade CLI does in Rust, but rewritten in Python (for dev/testing/prototyping purposes).

If you already have datasets organized in a dir structure similar to the nextlade_data repo, then you could run:

./scripts/rebuild --input-dir 'data/' --output-dir 'data_output/' --no-pull

And this should produce data_output/minimizer_index.json for the sort command, as well as data_output/index.json for dataset list and dataset get command, and everything else required for custom Nextclade dataset server to be operational.

Feel free to give the -m paramter a try and let me know if it works for you and whether you have ideas on how to improve things.

1 Like

Actually I just looked that the scripts/minimizer program can produce minimizer index JSON too. So you might not even need the rebuild script with all its complexity, if you just want the sort command. But need to check whether the output JSON it produces is in correct format. We don’t use scripts/minimizer normally, it’s more of a side-thing. In any case, I hope this Python code gives some idea or a starting point.

Another idea that just popped in my mind is that Nextclade CLI could in principle take a fasta with ref sequences and generate the index by itself, without external files - either as a separate command or in the beginning of the sort command. That would be an obvious UX improvement. I cannot guarantee that it will be done though - we focus on some other projects right now. Contributions are super welcome! (I made an issue here)

1 Like