Hi @eakin
Nextclade dev here.
Happy to see there’s interest for the sort
command.
I believe the only thing the sort
command needs from the server is so-called “minimizer index” file. You can download it and provide it using --input-minimizer-index-json
(shortcut -m
).
$ nextclade sort --help
...
-m, --input-minimizer-index-json <INPUT_MINIMIZER_INDEX_JSON>
Path to input minimizer index JSON file.
By default, the latest reference minimizer index is fetched from the dataset server (default or customized with `--server` argument). If this argument is provided, the algorithm skips fetching the default index and uses the index provided in the JSON file.
Supports the following compression formats: "gz", "bz2", "xz", "zst". Use "-" to read uncompressed data from standard input (stdin).
...
The “minimizer index json” is a large JSON file which contains a map of dataset names to the “minimizers” - hashed sequence fragments.
Release versions of Nextclade simply download it from https://data.clades.nextstrain.org/v3/minimizer_index.json
You can also find the latest (unstable) version of the minimizer_index.json
in the data repo’s data_output/
directory here (which is the current snapshot of the dataset server) on master
branch.
The stable version is in the same place, but on release
branch.
So what you can do is to download minimizer_index.json
file on internet-enabled machine:
curl -fsSLo minimizer_index.json https://data.clades.nextstrain.org/v3/minimizer_index.json
and then provide it to the sort
command like this on your offline machine:
nextclade sort -m minimizer_index.json ...
I believe in this case Nextclade CLI won’t make any network requests. If I understood correctly, that’s your goal.
For advanced use-cases, e.g. if you have your own datasets and references and you want to be able to detect/sort based on them, you could generate your own minimizer index file and use it offline or on your own dataset server. In our data repo the minimizer index is prepared as a part of the rebuild
script here. And the prototype of the sorting/detection algo is here - it’s the same thing the Nextclade CLI does in Rust, but rewritten in Python (for dev/testing/prototyping purposes).
If you already have datasets organized in a dir structure similar to the nextlade_data
repo, then you could run:
./scripts/rebuild --input-dir 'data/' --output-dir 'data_output/' --no-pull
And this should produce data_output/minimizer_index.json
for the sort
command, as well as data_output/index.json
for dataset list
and dataset get
command, and everything else required for custom Nextclade dataset server to be operational.
Feel free to give the -m
paramter a try and let me know if it works for you and whether you have ideas on how to improve things.