I’m attempting to create a new pathogen repo using the steps laid out on the Nextstrain website. I’ve successfully run the default ingest workflow for my pathogen with a few minor modifications (adding in some columns). However, my pathogen is a segmented virus with 10 segments and ideally I would like to view phylogenetic trees for each segments. While I could do this manually by subsetting the metadata and sequences using the segments column from the NCBI Datasets download, it would be great to have this done automatically within the ingest workflow. There are also many NCBI entries that don’t have anything listed in the segments column, which means an alignment to a reference might be necessary to determine segment.
It looks like there has been some discussion on what to do with segmented viruses (mainly here and here), with a few methods suggested (Lassa, Avian flu, and oropouche ingest workflows all seem to use slightly different methods). However, I’m unclear if a consensus had been reached. Has there been any further development of a way to deal with segments individually?
Sorry the consensus isn’t clearly stated in the pathogen-repo-guide issue. The implementation in oropouche (added in this PR) is the latest iteration on the ingest workflow for segmented viruses.
At a high level, the oropouche ingest workflow does the following:
download metadata and sequences from NCBI
run through the usual curation pipeline
align sequences to segment references via nextclade run to separate sequences by segment
merge metadata and Nextclade outputs
transform metadata from one row per accession to one row per strain, where a single strain is linked to multiple segment sequences
outputs 1 metadata.tsv + N segment FASTAs
Happy to answer any questions you have about the implementation details.
Thank you so much for your help! I’ve been able to successfully run the ingest workflow using the oropouche Github repo. I did run into trouble initially because I didn’t have anything in segment_resolutions.yaml, but once I added a few lines in there it ran beautifully. I do have a few related follow-up questions that I could use some help with:
Is it possible to include multiple sequences in each segment reference file used in nextclade run? There’s a huge amount of sequence variation within some of the segments and I’m noticing that quite a few seqs aren’t aligning to a ref and thus aren’t included in the final dataset. If I could include a few seq variations in each ref segment file, I would likely keep more sequences in the final dataset.
One of the segments (S1) has several hypervariable regions and many of the sequences in NCBI are just for one of those regions. Specifically, the full segment is ~1600bp, but a region that’s commonly sequenced is only 366bp. It would be great to construct trees of both the complete 1600bp segment and just the hypervariable regions. What would be the best way to do this? I could add additional “segments” to the Snakefile (e.g. segments = ['S1', 'S1_region1', 'S1_region2']) and then create reference seqs for each of them? Perhaps I would need a length filtering step in there too (nextclade run --min-length?). Any thoughts you have on this would be extremely helpful.
Lastly, just to make things more complicated for myself, I have several hundred viral genomes we sequenced in-house that I would (sometimes) like to add into the workflows. However, I’m not sure where I should add them in. Perhaps it makes the most sense to add them into the phylogenetic workflow following the steps laid out here?
Many, many thanks again for the help with this! My hope is to eventually have this pathogen added to nextstrain.org/Nextclade via Nextstrain Groups.
Glad to hear you were able to get things running based on the oropouche repo! Always happy to answer questions and improve the tools and documentation:
Based on Nextclade docs, the reference file can only be a FASTA file with exactly 1 sequence. You can explore the options for nextclade run to configure the alignment parameters for your case. I see there’s a --alignment-preset high-diversity option that can potentially be useful for sequence variation.
The generic pattern for including private sequences is being added to the pathogen-repo-guide in this PR. The short answer is yes, we are adding in data at the beginning of the phylogenetic workflow. This should work well as long as your in-house genomes have unique IDs and metadata needed for your phylogenetic analysis.