Dear Adam,
Nextclade software uses so-called datasets to get the information about particular species. Dataset is simply a directory with some files describing the organism. We periodically update both software and datasets, independently.
So, the lineages you get will depend on what version of Nextclade you are using and what version of what dataset you are using, which in turn depends on when and how Nextclade is being run.
You can find the dataset version you are using in Nextclade Web in the “Updated at” line of the selected dataset info. If you are using Nextclade CLI, then all versions (we also call them “tags”) can be listed with nextclade dataset list
. And you can download a particular version with nextclade dataset get --name=... --tag=...
(the latest tag if --tag
is omitted).
For each dataset, there is a CHANGELOG.md document with a history of changes for this particular dataset. For SC2 datasets, when a lineage is added to the dataset, the changelog mentions this lineage and its designation date. You can find CHANGELOG.md file in the downloaded dataset directory if using CLI, or in “History” tab of the “Dataset” page in Web. And also here: Releases · nextstrain/nextclade_data · GitHub. The last release of all 5 official SC2 datasets was 2 weeks ago.
It could be that you are using an old dataset and/or software. After quick search we noticed that the dataset versioning is discussed briefly here in Cecret docs: GitHub - UPHL-BioNGS/Cecret: Reference-based consensus creation
But from quick inspection I cannot tell what version of Nextclade software, which dataset and what version of it is in use.
For the best results, we recommend using the latest version of Nextclade, at the moment this is 3.9.1, and always downloading the latest dataset. If you need reproducible/comparable results across runs, then you could use the same version of Nextclade CLI and use --tag
option when downloading the dataset (or simply by keeping the same dataset around).
We cannot provide support for older versions, but if you notice a mistake in clade or lineage assignment with current software and dataset, let us know. Bugs and mistakes can happen. In this case, please also provide data example, as well as expected and observed result, so that we could reproduce, investigate and fix the issue. Without this information it’s hard to say anything definitive.
Note that Nextclade_pango
column in Nextclade is produced by an algorithm that is different from what’s in Pangolin software. They are completely unrelated pieces of software. You can read more about “Nextclade as pango classifier” methodology here: Nextclade as pango lineage classifier: Methods and Validation — Nextclade documentation
and, of course, more generally, about strengths and limitations of Nextclade, in its algorithms documentation, especially the “Tree placement” and “Clade assignment” sections: Algorithm — Nextclade documentation
Finally, the Nextclade citation info is here: Nextclade: analysis of viral genetic sequences — Nextclade documentation. The version of Nextclade Web is displayed in the bottom right corner, and the version of Nextclade CLI can be shown with nextclade --version
. Dataset tag can be found in the “Updated at” if using Web, or on top of the CHANGELOG.md
file, or in .version.tag
field of the pathogen.json
file if using CLI.