Reduce number of references in RSV phylogenetic tree - Nextclade CLI

Hello,

I’m running Nextclade CLI on some RSV samples using the RSV Nextstrain datasets and I notice the phylogenetic tree that is produced is very large. How would you suggest to subsample the dataset to reduce the number of reference sequences shown on the tree? Ideally I only want a few representative sequences (one of each rsv clade) plus my samples to be shown.

Thanks!

Hi @jojo

Can you explain your use-case a little?

There is currently no way other than creating your own dataset, either from scratch, to build a new reference tree, or by pruning the existing reference tree. The Snakemake workflows for creating these datasets are available on GitHub - you can find links in the readme file of the dataset (it is also rendered on the dataset page in Nextclade Web).

As a simpler alternative, you could post-process Nextclade’s output files to your liking. But there’s always a danger that with manual interventions you create a picture of what you’d comfortable to see, not what there is in reality.

In the time being, what you can do is if you are browsing the resulting tree in Auspice viewer (embedded into Nextclade Web or standalone), then you can filter by “Node type: New”. This will hide the reference nodes and will only keep the branches and your samples, so that they are more visible. This is the default when navigating to the tree page in Nextclade Web.

Also, you can zoom-into a particular clade if you click on a branch.

Lastly, if you just need to know the clades of your samples (e.g. for medical diagnostics) you can open output tsv file in Excel and sort rows by clade column.

I only want a few representative sequences (one of each rsv clade)

This might not be a fantastic idea - the sequences are inevitably a little fuzzy, both your input sequences and the sequences on the reference tree. So for accurate placement it makes sense to have some decent amount of diversity here and there to help resolving ambiguities. This is particularly important for low-quality and partial sequences (which happens very often in RSV sequencing) as well as in presence of contamination and recombination.

In fact, the trend seems to be the opposite - to increase tree sizes. For example, our colleagues in UCSC building Usher can work with SARS-CoV-2 trees with millions of samples. In Nextclade we limit number of reference samples for technical reasons, but we’d like to increase these limits in the future.