Teacher Q on Nextclade RSV-A reference dataset

Achattwood · September 26, 2025, 9:49pm

Note: I know next to nothing about bioinformatics

I am a teacher that used the Nextclade RSV-A reference dataset as part of a classroom activity for Gr.11 students. The activity relied on the four nucleotide sequences linked at the bottom of this post.

After what I assume was an update to the reference dataset on 2025-09-09 only one of these four sequences remains as part of the dataset.

Hoping this forum can help me find a solution that doesn’t involve rewriting the lesson plan.

Ways I have been thinking about this:

Is there a way to recover previous reference dataset versions in Nextclade?
Can I manually add the three lost sequences back to the reference dataset?
If I rewrite the lesson plan, will a future update erase my efforts?

Guidance from the community very much appreciated!

Thank you,

Alex

Pubmed Sequence / Nextclade Sample ID
PP376762 / 010203010
PP376826 / 010203025
PP376478 / 010203023
PP376678 / 010203031

Source: Langedijk,A.C., Lebbink,R.J., Bont,L.J. and Evers,A. The Genomic Evolutionary Dynamics and Global Circulation Patterns of Respiratory Syncytial Virus

ivan-aksamentov · September 28, 2025, 1:40pm

Hi @Achattwood

The complexity of it depends on where exactly you want to add these sequences. Please give some more information, perhaps screenshots.

I’ll answer the questions literally for now:

Is there a way to recover previous reference dataset versions in Nextclade?

There are many ways to do this, but I think neither is very convenient currently. The easiest is probably to pre-configure Nextclade Web to load a dataset by it’s “path” and “tag” like this:

https://clades.nextstrain.org/?dataset-name=nextstrain/rsv/a/EPI_ISL_412866&dataset-tag=2025-09-09–12-13-13Z

However, currently dataset-tag= parameter seems to be buggy and causes a crash for all tags except latest. I’ll investigate and try to fix it.

Once it’s working, if you know a tag, you can roll back to it. All tags are listed on “Releases” page in nextclade_data GitHub repo: https://github.com/nextstrain/nextclade_data/releases?q=“rsv”&expanded=true or in the data_output/index.json file: https://github.com/nextstrain/nextclade_data/blob/1e827a5c13f27553e5cd1b7b2d7c3f3c96e2c2fc/data_output/index.json#L1977-L2035

Similarly, If you use Nextclade CLI, you can use --tag with the nextclade dataset get command when fetching dataset files.

If you want to inspect dataset files, then “Releases” contain zip archives of the datasets, for each tag. Or you can find all versions of all datasets in the data_output/ directory of the repo.

More complicated ways, but with more control:

Use ?input-fasta in web, providing URL to existing hosted fasta file, e.g. from nextclade_data repo: https://clades.nextstrain.org/?input-fasta=https://github.com/nextstrain/nextclade_data/blob/master/data_output/nextstrain/rsv/a/EPI_ISL_412866/2025-08-25–09-00-35Z/sequences.fasta
Use ?dataset-url= in web, providing url to existing hosted previous version of this dataset (e.g. from our nextclade_data repo), and possibly combine it with ?input-fasta
Download and host a dataset you need and use ?dataset-url= in web or --input-dataset in CLI
Clone the entire nextclade_data repo and modify any datasets as you see fit, then use ?dataset-server in web or --server in CLI.

If you want to use custom single dataset or an entire dataset server, then you’d need to host these datasets somewhere - on your local computer, on local network or on the internet. GitHub also works. See the docs.

Relevant docs:

Nextclade Web: URL parameters
Nextclade CLI: nextclade dataset get
Nextclade Data docs (which explains how to create, modify, serve datasets)

Can I manually add the three lost sequences back to the reference dataset?

I am not sure where you want to add them exactly.

I haven’t checked the datasets, sorry, but if you want to add sequences to input fasta (the sequences being analyzed), then you could drag and drop multiple files in web, or append them to an existing fasta file in the existing dataset (see FASTA format), or use ?input-fasta in web (multiple occurrences allowed), or provide multiple positional arguments in CLI.

If you are talking about samples on the reference tree, then you’d have to re-build the tree, as explained in the “Dataset creation guide“.

If I rewrite the lesson plan, will a future update erase my efforts?

Yes, likely. Regarding example input sequences - these are randomly chosen by dataset authors, based on their own considerations. Usually, it’s no more than to showcase the various features of the dataset. The examples are non-significant in Nextclade’s main use-case, because most users come with their own data to be analyzed. If we are talking about input sequences, then you can drag and drop any fasta file - with any sequences you like.

Regarding reference trees - the subsampling step during tree constructions is non-deterministic, so it’s hard to predict which sequences end up on a tree. But there are some toggles if you are willing to build your own tree. It’s quite involved technically.

For reproducibility, you’d want to “freeze“ the dataset version used: either by using a concrete tag, or by using your custom dataset or at least custom input sequences (if that’s the only requirement).

You’d also want to freeze the software version. We keep updating the software, fixing bugs and adding new features. This could cause unintended consequences, which are hard to predict on our side, though we try to avoid any obvious breakage for users. Currently you cannot select version of Nextclade Web software, it’s always the latest. You can host your own copy of Nextclade Web though - the software is open-source. And you can download any version of Nextclade CLI if that’s what you use.

Let me know if it helps at all or how I can assist further.

P.S. We might need to consider to add some more convenient ways to select dataset versions in Web. So far the assumption was that most web users would want to always use the latest and greatest, with “advanced“ feature of CLI being an ability to “freeze“ the version for reproducibility.

trs · September 29, 2025, 7:06pm

For what it’s worth, @Achattwood, I took a quick peek at the last few versions of the Nextclade dataset (nextstrain/rsv/a) and first found the four sequences you listed in tag (version) 2024-11-27--02-51-00Z. So if you want to pin the dataset version, that’s likely the one you’ll want.

Achattwood · October 1, 2025, 7:29pm

Hello Ivan and Thomas. Thank you for taking the time to help with this! I have tried only the three suggestions that involve accessing the release 2024-11-27--02-51-00Z suggested by Thomas through the Web browser. I think the 3rd one worked…

Here’s what I did:

I tried replacing the lastest tag in the URL you posted with the previous release tag. I imagine the below screenshot details the crash you expected to see.

This worked in that inclusion of 2024-11-27--02-51-00Z tag in place of latest version seemed to pull in sequences. Only issue I have here is that I can’t match the sequence names I provided to any of those in the screenshot below. Since Thomas took a quick peek and managed to find them in the dataset, I may be doing something wrong here.

Following the clades_nextstrain_org URL with the following seemed to work!

/?dataset-url=nextclade_data/data/nextstrain/rsv/a/EPI_ISL_412866 at 9b77ab0ff20e510d0e8e02e8724a9c803d46920b · nextstrain/nextclade_data · GitHub

Attached is a screenshot of the tree that students will create, correctly labelled with comparison sequences.

Thanks again for your help. I will try and implement this within the lesson plan and be back in touch if I run into any further techincal issues I can’t figure out. Feeling optimistic right now, though! – Alex

ivan-aksamentov · October 2, 2025, 11:05am

@Achattwood I believe I have fixed the error when using non-latest tags in URL params and I also added tag selector in the user interface: when a dataset is selected, there is now a little triangle near “Updated at” line in the dataset info, which allows you to pick a tag among existing tags for this particular dataset. The feature is currently in testing phase on https://master.clades.nextstrain.org/. If all goes well, I will release it mid-next-week to the main site. Don’t hesitate to give it a try and let me know if you notice any defects.

Alternatively, the non-latest ?dataset-tag values should also work in the URL params now:

https://master.clades.nextstrain.org/?dataset-name=nextstrain/rsv/a/EPI_ISL_412866&dataset-tag=2024-11-27–02-51-00Z

(notice master. in the domain name - it will be needed until this is deployed to the release branch of the software. I don’t recommend using “master” site long-term, it is automatically deployed from the master branch in GitHub repo and could be unstable)

Achattwood · October 2, 2025, 5:18pm

@ivan-aksamentov it worked perfectly for me. no defects.

Topic		Replies	Views
How to type RSV clades General	3	119	October 9, 2024
Reference tree for RSV used in NextClade webstie	3	255	June 19, 2024
RSV-A Reference Sequence Question General	2	65	June 13, 2025
How to change the reference sequence?	2	61	July 18, 2024
Nextclade CLI clade calls different from Nextclade Web calls General	3	592	December 10, 2021

Teacher Q on Nextclade RSV-A reference dataset

Related topics