Consensus on how to deal with segmented viruses?

eam · September 9, 2025, 6:04pm

I’m attempting to create a new pathogen repo using the steps laid out on the Nextstrain website. I’ve successfully run the default ingest workflow for my pathogen with a few minor modifications (adding in some columns). However, my pathogen is a segmented virus with 10 segments and ideally I would like to view phylogenetic trees for each segments. While I could do this manually by subsetting the metadata and sequences using the segments column from the NCBI Datasets download, it would be great to have this done automatically within the ingest workflow. There are also many NCBI entries that don’t have anything listed in the segments column, which means an alignment to a reference might be necessary to determine segment.

It looks like there has been some discussion on what to do with segmented viruses (mainly here and here), with a few methods suggested (Lassa, Avian flu, and oropouche ingest workflows all seem to use slightly different methods). However, I’m unclear if a consensus had been reached. Has there been any further development of a way to deal with segments individually?

Thanks so much!

joverlee · September 9, 2025, 7:15pm

Hi @eam,

Sorry the consensus isn’t clearly stated in the pathogen-repo-guide issue. The implementation in oropouche (added in this PR) is the latest iteration on the ingest workflow for segmented viruses.

At a high level, the oropouche ingest workflow does the following:

download metadata and sequences from NCBI
run through the usual curation pipeline
align sequences to segment references via nextclade run to separate sequences by segment
merge metadata and Nextclade outputs
transform metadata from one row per accession to one row per strain, where a single strain is linked to multiple segment sequences
outputs 1 metadata.tsv + N segment FASTAs

Happy to answer any questions you have about the implementation details.

Jover

eam · September 12, 2025, 4:06pm

Thank you so much for your help! I’ve been able to successfully run the ingest workflow using the oropouche Github repo. I did run into trouble initially because I didn’t have anything in segment_resolutions.yaml, but once I added a few lines in there it ran beautifully. I do have a few related follow-up questions that I could use some help with:

Is it possible to include multiple sequences in each segment reference file used in nextclade run? There’s a huge amount of sequence variation within some of the segments and I’m noticing that quite a few seqs aren’t aligning to a ref and thus aren’t included in the final dataset. If I could include a few seq variations in each ref segment file, I would likely keep more sequences in the final dataset.
One of the segments (S1) has several hypervariable regions and many of the sequences in NCBI are just for one of those regions. Specifically, the full segment is ~1600bp, but a region that’s commonly sequenced is only 366bp. It would be great to construct trees of both the complete 1600bp segment and just the hypervariable regions. What would be the best way to do this? I could add additional “segments” to the Snakefile (e.g. segments = ['S1', 'S1_region1', 'S1_region2']) and then create reference seqs for each of them? Perhaps I would need a length filtering step in there too (nextclade run --min-length?). Any thoughts you have on this would be extremely helpful.
Lastly, just to make things more complicated for myself, I have several hundred viral genomes we sequenced in-house that I would (sometimes) like to add into the workflows. However, I’m not sure where I should add them in. Perhaps it makes the most sense to add them into the phylogenetic workflow following the steps laid out here?

Many, many thanks again for the help with this! My hope is to eventually have this pathogen added to nextstrain.org/Nextclade via Nextstrain Groups.

joverlee · September 12, 2025, 6:46pm

Glad to hear you were able to get things running based on the oropouche repo! Always happy to answer questions and improve the tools and documentation:

Based on Nextclade docs, the reference file can only be a FASTA file with exactly 1 sequence. You can explore the options for nextclade run to configure the alignment parameters for your case. I see there’s a --alignment-preset high-diversity option that can potentially be useful for sequence variation.
The team has done something similar for measles, where we have a full genome analysis and a 450bp region of the N gene (“N450”) analysis. The measles phylogenetic workflow has separate alignment and filter rules for the N450 analysis to handle the shorter sequence. As you’ve suggested, the alignment step uses nextclade run --min-length and a shorter reference sequence for the 450bp region.
The generic pattern for including private sequences is being added to the pathogen-repo-guide in this PR. The short answer is yes, we are adding in data at the beginning of the phylogenetic workflow. This should work well as long as your in-house genomes have unique IDs and metadata needed for your phylogenetic analysis.

eam · September 18, 2025, 5:30pm

Thanks so much for the help!

RE Q1: I did try running nextclade with the --alignment-preset high-diversity and it does increase the number of alignments. However, I also get quite a few instances where an accession aligns to more than one segment, which means the accession is completely dropped (correct?). I tried to reduce spurious alignments by including --min-seed-cover 0.1, but that didn’t seem to have any effect. I took a closer look at a few cases, and this appears to happen when the accession is only a small part of a segment. In instances like these would it make sense to filter by something like alignment score so as not to lose the accession altogether?

RE Q2: Would there be any reason to do this length filtering step in the ingest workflow (e.g. when doing the nextclade alignments, have reference sequences for each region of interest)? Or does it make sense to just filter to the specific regions of interest in the phylogenetic workflow?

Thank you so much again!

Edited to add: I also have a related issue with one segment (S1) that is made up for three genes. Accessions sometimes are the entire S1 sequence, but more often S1 accessions are just one of (or part of) the three genes. During group_segments.py, a single accession is chosen to represent all of S1. This could end up being the entire S1 sequence (ideal situation), or it could end up being just a single gene sequence with the other two gene accessions being dropped (not ideal since we now don’t have the full S1 sequence). This makes me think that I need to include reference sequences for not just the full S1 segment, but also for each S1 gene. Then each accession could be classified as ‘S1_full’, ‘S1_g1’, ‘S1_g2’, or ‘S1_g3’. However, going back to my original question about multi-mapping segments with Nextclade, I’m not sure how I could get each accession to align to just one of the four S1 refs. Hopefully I’m making some sense here!

joverlee · September 22, 2025, 6:10pm

accession aligns to more than one segment, which means the accession is completely dropped (correct?).

Correct, within the group_segments.py script, if an accession aligns to multiple segments then it is dropped.

I took a closer look at a few cases, and this appears to happen when the accession is only a small part of a segment. In instances like these would it make sense to filter by something like alignment score so as not to lose the accession altogether?

You can consider comparing alignments by coverage and pick the segment that has a higher coverage.

Would there be any reason to do this length filtering step in the ingest workflow (e.g. when doing the nextclade alignments, have reference sequences for each region of interest)? Or does it make sense to just filter to the specific regions of interest in the phylogenetic workflow?

This really depends on if you want to use the same ingest workflow for multiple downstream analyses. If you want to start with single set of clean data for multiple analyses, then it makes more sense to filter to specific regions of interest in the phylogenetic workflow.

Topic		Replies	Views
Ingest workflow: NCBI Datasets, Entrez, or something else? Help and Getting Started	1	120	June 28, 2024
Help needed to bulid Pylogenetic analysis for viruses	2	455	March 8, 2021
Updated example command needed for updated GISAID file	4	581	August 30, 2021
1 fundamental (maybe naive) question on nextStrain	1	454	May 19, 2021
Metadata.tsv: building custom nextclade dataset Help and Getting Started	3	71	August 20, 2025

Consensus on how to deal with segmented viruses?

Related topics