Teacher Q on Nextclade RSV-A reference dataset

Note: I know next to nothing about bioinformatics

I am a teacher that used the Nextclade RSV-A reference dataset as part of a classroom activity for Gr.11 students. The activity relied on the four nucleotide sequences linked at the bottom of this post.

After what I assume was an update to the reference dataset on 2025-09-09 only one of these four sequences remains as part of the dataset.

Hoping this forum can help me find a solution that doesn’t involve rewriting the lesson plan.

Ways I have been thinking about this:

  1. Is there a way to recover previous reference dataset versions in Nextclade?
  2. Can I manually add the three lost sequences back to the reference dataset?
  3. If I rewrite the lesson plan, will a future update erase my efforts?

Guidance from the community very much appreciated!

Thank you,

Alex

Pubmed Sequence / Nextclade Sample ID
PP376762 / 010203010
PP376826 / 010203025
PP376478 / 010203023
PP376678 / 010203031

Source: Langedijk,A.C., Lebbink,R.J., Bont,L.J. and Evers,A. The Genomic Evolutionary Dynamics and Global Circulation Patterns of Respiratory Syncytial Virus

Hi @Achattwood

The complexity of it depends on where exactly you want to add these sequences. Please give some more information, perhaps screenshots.

I’ll answer the questions literally for now:

Is there a way to recover previous reference dataset versions in Nextclade?

There are many ways to do this, but I think neither is very convenient currently. The easiest is probably to pre-configure Nextclade Web to load a dataset by it’s “path” and “tag” like this:

https://clades.nextstrain.org/?dataset-name=nextstrain/rsv/a/EPI_ISL_412866&dataset-tag=2025-09-09–12-13-13Z

However, currently dataset-tag= parameter seems to be buggy and causes a crash for all tags except latest. I’ll investigate and try to fix it.

Once it’s working, if you know a tag, you can roll back to it. All tags are listed on “Releases” page in nextclade_data GitHub repo: https://github.com/nextstrain/nextclade_data/releases?q=“rsv”&expanded=true or in the data_output/index.json file: https://github.com/nextstrain/nextclade_data/blob/1e827a5c13f27553e5cd1b7b2d7c3f3c96e2c2fc/data_output/index.json#L1977-L2035

Similarly, If you use Nextclade CLI, you can use --tag with the nextclade dataset get command when fetching dataset files.

If you want to inspect dataset files, then “Releases” contain zip archives of the datasets, for each tag. Or you can find all versions of all datasets in the data_output/ directory of the repo.

More complicated ways, but with more control:

If you want to use custom single dataset or an entire dataset server, then you’d need to host these datasets somewhere - on your local computer, on local network or on the internet. GitHub also works. See the docs.

Relevant docs:

Can I manually add the three lost sequences back to the reference dataset?

I am not sure where you want to add them exactly.

I haven’t checked the datasets, sorry, but if you want to add sequences to input fasta (the sequences being analyzed), then you could drag and drop multiple files in web, or append them to an existing fasta file in the existing dataset (see FASTA format), or use ?input-fasta in web (multiple occurrences allowed), or provide multiple positional arguments in CLI.

If you are talking about samples on the reference tree, then you’d have to re-build the tree, as explained in the “Dataset creation guide“.

If I rewrite the lesson plan, will a future update erase my efforts?

Yes, likely. Regarding example input sequences - these are randomly chosen by dataset authors, based on their own considerations. Usually, it’s no more than to showcase the various features of the dataset. The examples are non-significant in Nextclade’s main use-case, because most users come with their own data to be analyzed. If we are talking about input sequences, then you can drag and drop any fasta file - with any sequences you like.

Regarding reference trees - the subsampling step during tree constructions is non-deterministic, so it’s hard to predict which sequences end up on a tree. But there are some toggles if you are willing to build your own tree. It’s quite involved technically.

For reproducibility, you’d want to “freeze“ the dataset version used: either by using a concrete tag, or by using your custom dataset or at least custom input sequences (if that’s the only requirement).

You’d also want to freeze the software version. We keep updating the software, fixing bugs and adding new features. This could cause unintended consequences, which are hard to predict on our side, though we try to avoid any obvious breakage for users. Currently you cannot select version of Nextclade Web software, it’s always the latest. You can host your own copy of Nextclade Web though - the software is open-source. And you can download any version of Nextclade CLI if that’s what you use.

Let me know if it helps at all or how I can assist further.


P.S. We might need to consider to add some more convenient ways to select dataset versions in Web. So far the assumption was that most web users would want to always use the latest and greatest, with “advanced“ feature of CLI being an ability to “freeze“ the version for reproducibility.

For what it’s worth, @Achattwood, I took a quick peek at the last few versions of the Nextclade dataset (nextstrain/rsv/a) and first found the four sequences you listed in tag (version) 2024-11-27--02-51-00Z. So if you want to pin the dataset version, that’s likely the one you’ll want.

Hello Ivan and Thomas. Thank you for taking the time to help with this! I have tried only the three suggestions that involve accessing the release 2024-11-27--02-51-00Z suggested by Thomas through the Web browser. I think the 3rd one worked…

Here’s what I did:

I tried replacing the lastest tag in the URL you posted with the previous release tag. I imagine the below screenshot details the crash you expected to see.

This worked in that inclusion of 2024-11-27--02-51-00Z tag in place of latest version seemed to pull in sequences. Only issue I have here is that I can’t match the sequence names I provided to any of those in the screenshot below. Since Thomas took a quick peek and managed to find them in the dataset, I may be doing something wrong here.

<Deleted screenshot because I can only upload 1 media item>

Following the clades_nextstrain_org URL with the following seemed to work!

/?dataset-url=nextclade_data/data/nextstrain/rsv/a/EPI_ISL_412866 at 9b77ab0ff20e510d0e8e02e8724a9c803d46920b · nextstrain/nextclade_data · GitHub

Attached is a screenshot of the tree that students will create, correctly labelled with comparison sequences.

<Deleted screenshot because I can only upload 1 media item>

Thanks again for your help. I will try and implement this within the lesson plan and be back in touch if I run into any further techincal issues I can’t figure out. Feeling optimistic right now, though! – Alex

@Achattwood I believe I have fixed the error when using non-latest tags in URL params and I also added tag selector in the user interface: when a dataset is selected, there is now a little triangle near “Updated at” line in the dataset info, which allows you to pick a tag among existing tags for this particular dataset. The feature is currently in testing phase on https://master.clades.nextstrain.org/. If all goes well, I will release it mid-next-week to the main site. Don’t hesitate to give it a try and let me know if you notice any defects.

Alternatively, the non-latest ?dataset-tag values should also work in the URL params now:

https://master.clades.nextstrain.org/?dataset-name=nextstrain/rsv/a/EPI_ISL_412866&dataset-tag=2024-11-27–02-51-00Z

(notice master. in the domain name - it will be needed until this is deployed to the release branch of the software. I don’t recommend using “master” site long-term, it is automatically deployed from the master branch in GitHub repo and could be unstable)

@ivan-aksamentov it worked perfectly for me. no defects.