Web App and Nextclade (Pango)

Hello, I am a research specialist at the Marshfield Clinic Research Institute, and I have previously used the Nextclade web app for Flu and SARS-CoV-2 lineage reports. Recently, I noticed our SARS-CoV-2 bioinformatics analysis pipeline, Cecret, has not been generating up to date Pangolin Lineage calls. All the samples are returning JN.1 calls, with an occasional BA.2.86 as well. I then plugged in the consensus fasta sequences generated from the Cecret bioinformatics workflow into the Nextclade web app, and exported the nextclade.csv results file. My question is what is the pango version/dataset used to generate “Nextclade_pango” results in the CSV file, and how would I be able to cite the version of the web app?

I initially ran my consensus fasta sequences through the pangolin web app, but their data is even more out of date than our Cecret bioinformatics pipeline Pangolin call. Is the Nextclade_pango column a different algorithm then using the most up to date pangolin version?
Thanks,
Adam Bissonnette

Dear Adam,

Nextclade software uses so-called datasets to get the information about particular species. Dataset is simply a directory with some files describing the organism. We periodically update both software and datasets, independently.

So, the lineages you get will depend on what version of Nextclade you are using and what version of what dataset you are using, which in turn depends on when and how Nextclade is being run.

You can find the dataset version you are using in Nextclade Web in the “Updated at” line of the selected dataset info. If you are using Nextclade CLI, then all versions (we also call them “tags”) can be listed with nextclade dataset list. And you can download a particular version with nextclade dataset get --name=... --tag=... (the latest tag if --tag is omitted).

For each dataset, there is a CHANGELOG.md document with a history of changes for this particular dataset. For SC2 datasets, when a lineage is added to the dataset, the changelog mentions this lineage and its designation date. You can find CHANGELOG.md file in the downloaded dataset directory if using CLI, or in “History” tab of the “Dataset” page in Web. And also here: Releases · nextstrain/nextclade_data · GitHub. The last release of all 5 official SC2 datasets was 2 weeks ago.

It could be that you are using an old dataset and/or software. After quick search we noticed that the dataset versioning is discussed briefly here in Cecret docs: GitHub - UPHL-BioNGS/Cecret: Reference-based consensus creation
But from quick inspection I cannot tell what version of Nextclade software, which dataset and what version of it is in use.

For the best results, we recommend using the latest version of Nextclade, at the moment this is 3.9.1, and always downloading the latest dataset. If you need reproducible/comparable results across runs, then you could use the same version of Nextclade CLI and use --tag option when downloading the dataset (or simply by keeping the same dataset around).

We cannot provide support for older versions, but if you notice a mistake in clade or lineage assignment with current software and dataset, let us know. Bugs and mistakes can happen. In this case, please also provide data example, as well as expected and observed result, so that we could reproduce, investigate and fix the issue. Without this information it’s hard to say anything definitive.

Note that Nextclade_pango column in Nextclade is produced by an algorithm that is different from what’s in Pangolin software. They are completely unrelated pieces of software. You can read more about “Nextclade as pango classifier” methodology here: Nextclade as pango lineage classifier: Methods and Validation — Nextclade documentation
and, of course, more generally, about strengths and limitations of Nextclade, in its algorithms documentation, especially the “Tree placement” and “Clade assignment” sections: Algorithm — Nextclade documentation

Finally, the Nextclade citation info is here: Nextclade: analysis of viral genetic sequences — Nextclade documentation. The version of Nextclade Web is displayed in the bottom right corner, and the version of Nextclade CLI can be shown with nextclade --version. Dataset tag can be found in the “Updated at” if using Web, or on top of the CHANGELOG.md file, or in .version.tag field of the pathogen.json file if using CLI.