Providing Multiple Sequence Alignments with NexStrain Trees


I have been looking at, and using NextStrain trees, primarily to follow the evolution of the variants of interest in SAR-CoV-2 since 2020, for a few years. I am new to creating my own NexStrain trees, and setting up a “groups” page for my group (HIV Databases at LANL).

I know that for many data sets, the sequences are obtained from the GISAID, and there are restrictions on redistributing those sequences in public, such as by sharing the multiple sequence alignment that was used to generate the tree. However, for many other data sets and trees I am looking at on the NextStrain site, the data all came from GenBank, and it should be possible to share the alignment. The issue I am interested in, is that when I view the tree and the entropy/events plot below it, and find a site or small region of interest, it is very difficult to locate exactly where that site or region is in the genomes. In the example screenshot I am uploading here, I can see that nucleotide 1530 has high entropy and number of events, it is just into the E2 region of the polyprotein gene. But Locating that column in my multiple sequence alignment of HCV genomes is not easy, as my alignment is not exactly the same as that used for this tree.

Hi Brian,

we share alignments for a number of analyses. But this involves a few extra steps when setting up the build and this doesn’t happen for all builds.

the builds your are looking at were set up by Katie Kistler in Trevor’s group.


Yes, I was tempted to write to Katie and Trevor, but my question here is no really about that particular alignment and tree. It is more about providing tools to make it easier to locate sites or regions of interest in any Nextstrain tree. In a Nextstrain tree, I can click on any tip to get information about that sequence, including a list of its mutations. For one example here in the HCV1b tree:

KX767018 from the GenBank record with that accession number
E2 Changes (23):N1G, H3Y, V4T, A9E, R14H, G15R, F16L, T17V, F20L, I31V, P78S, D80G, K81E, H95N, S97P, S117R, V141A, I225L, I291V, I322V, V326F, F329L, A330V
Reversions to root (1):P24P

So it seems each sequence is compared to a “reference sequence” to get this list of where each sequence has changes from that reference. I suspect the “reference” is the computed common ancestor for the tree. GenBank does not allow us to create a GenBank entry/record for any “theoretical” sequence such as the consensus of a group or the computed common ancestor. GenBank does provide “reference sequences” for most species of virus, and the reference for HCV genotype 1 is ( Hepatitis C virus genotype 1, complete genome - Nucleotide - NCBI ) complete with annotations of the mature proteins such as the E2.

So it would be fairly straighforward to provide a reference sequence, or better yet an alignment of a few reference sequences plus the computed ancestral sequence, which would then make it easy to see which base in the genome is highlighted in my figure above as base 1530 in the genome.

pretty much all core nextstrain builds use a reference alignment, either an MSA where insertions relative to the reference stripped, or pairwise alignment to the reference. We do this to provide a fixed coordinate system for mutation numbering.

Auspice reports differences to the root node as well as positions at which somewhere along the path a second mutation restored the root state. In most cases, the sequence we compare to here is the reconstructed sequence of the root node (again in reference coordinates). Nextclade, however, aligns to a concrete reference sequence and reports mutations relative to this reference.

In the tree-view of nextclade, there is a hidden node upstream of the root that is the reference and mutations are relative to the reference (though this can inflate the number of ‘reversions’).

We could surface the reference sequence used more prominently. But they can be found in the github repositories associated with the analysis.

Thanks!! That is very helpful! I am finding that the numbering of sites is much closer to the “reference” that I would have expected for many multiple sequence alignments, so the stripping of gaps or pairwise alignment to a reference explains that.

We (the HIV Databases group) are not putting our alignments in the github repository. Maybe we should be. We were hoping to make them very easy to find, making them available on our pull-down menu of alignment types as Nextstrain on our pre-built alignments page.

For HIV-1, we don’t like to strip gaps or insertions because insertions and deletions are just too common in HIV-1. So then the numbering of sites/locations is much more of a pain. We use the HIV-1 subtype B HXB2 (GenBank K03455) as our reference, but it is not a very good reference in some respects, it was only chosen because it was the first clone to be fully sequenced.

Thanks also for explaining nextclade vs nextstrain. I had not noticed that difference.