Nextstrain phylogenetic tree in the format used by UCSC

Is there a downloadable version of the SARS-CoV-2 Nextstrain phylogenetic tree (from GISAID data) with the same (or similar) format to the one used by UCSC (here: kent/src/hg/utils/otto/sarscov2phylo/pango.clade-mutations.tsv at master · ucscGenomeBrowser/kent · GitHub)?
I found the newick file, but I am looking for one that already incorporates mutations that differentiate the intermediate nodes of the tree (either nucleotide or even amino acid mutations). Thank you.

Hello – sharing a tree that includes GISAID data is usually not possible due to GISAID restrictions (the permission of the submitter of each GISAID sequence that appears in the tree would be required). I’m guessing that is why the Nextstrain team has not replied yet.

However, the nextclade dataset for SARS-CoV-2 includes an Auspice v2 JSON tree, which I believe is constructed mostly from lineage-consensus pseudo-sequences. Auspice v2 JSON trees have mutations annotated on the nodes. The nextclade tree should have very similar mutations to pango.clade-mutations.tsv:

https://raw.githubusercontent.com/nextstrain/nextclade_data/master/data/nextstrain/sars-cov-2/wuhan-hu-1/orfs/tree.json

The format is completely different (a JSON tree vs. TSV) but again, the set of mutations along the path to the node for each lineage should mostly agree, with some exceptions:

  • the nextclade tree represents deletions as one “-” per base, while indels are completely ignored by UShER so they’re missing from the UCSC file
  • the UShER tree masks Problematic Sites, so those never appear in the UCSC file
  • the UShER tree has branch-specific masking of many sites that are prone to sequencing artefacts. Reversions on those sites are removed from the UShER tree, which unfortunately causes some recombinants to appear (in the UShER tree and UCSC file) to have mutations that they do not have.

@corneliusroemer also maintains a github repository with the consensus sequence for each Pango lineage; a set of mutations could be obtained directly from those by alignment to the reference. GitHub - corneliusroemer/pango-sequences: Consensus sequences for each Pango lineage

Thank you very much for this explanation. Indeed, I had not considered that the GISAID data may not be publicly exposed.
Otherwise, the Nextstrain Json file is interesting and seems to align well with your UCSC tree (thank you for enumerating the specific differences between the two approaches!). My concern is with the “updating policy” of these two valuable structures: how often do you update the UCSC tree?
As per the repo with the consensus sequences, this is also very interesting, however, I prefer not to depend too much on the specific lineages but, rather, to be able to observe what happens on the intermediate nodes as well.

[Sorry that I didn’t see your reply until just now, I assumed the system would email me but apparently I don’t have it configured.]

how often do you update the UCSC tree?

I maintain a daily-updated tree of pretty much all available SARS-CoV-2 sequences (that I can’t share because of GISAID restrictions), and a shareable version without GISAID sequences (only INSDC, COG-UK and China National Center for Bioinformation / GenBase) that can be downloaded here:

https://hgdownload.gi.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/

The files are very large. The 2024-09-10 build has 8,169,264 sequences. But if you want intermediate nodes… there sure are a lot! :slight_smile:

Unfortunately I can’t update the git version of pango.clade-mutations.tsv because its size has exceeded our local git repo’s file size limits. A link to the ‘live’ version (i.e. what I’m using in the daily build) is here:

https://hgwdev.gi.ucsc.edu/~angie/pango.clade-mutations.tsv

warning: I am considering changing the format of pango.clade-mutations.tsv in order to reduce its size, by using parent lineage names instead of repeating their paths before the mutations that distinguish the child lineages.

1 Like

Thank you very much for this information and for your great service to the community!

1 Like

Dear Angie,
we are looking for an automatic way to retrieve the “last modified” date for the file you expose at https://hgwdev.gi.ucsc.edu/~angie/pango.clade-mutations.tsv.

Would it be possible to add this info as a header of the file or, alternatively, can we trust that the date/information shown in the file “public-latest.version.txt” available at Index of /goldenPath/wuhCor1/UShER_SARS-CoV-2 is the same one of pango.clade-mutations.tsv?

Thank you!

Yes, this curl command shows the last-modified date of pango.clade-mutations.tsv:

curl -SsI https://hgwdev.gi.ucsc.edu/~angie/pango.clade-mutations.tsv | grep Last-Mod
Last-Modified: Tue, 05 Nov 2024 03:07:15 GMT

That URL is a symlink to the live file that I edit, and it’s possible for my most recent edits to be more recent than the public tree files, but usually I edit the file only when new lineages are annotated or I find something that needs to be fixed. However, I made some extra edits yesterday to make it more easy to automatically collapse mutation paths (so that e.g. KP.3.1.1 will be only “KP.3.1 > C12616T > A13121T” instead of the full path of all mutations) in the future (possibly in the next couple weeks).

1 Like

Great, thank you very much for providing the direct command for the date and for informing us about the “format change” with a clear example.
I have another curiosity: will you ever consider incorporating deletions in the tree you maintain (even if they are currently ignored by UShER)?

Sorry, at this point there are no plans to annotate deletions on the tree. Also no plans to support deletions in UShER. Yatish Turakhia’s group at UC San Diego is developing graph-based tools that will be reference-free and will support indels and rearrangements, but at this point they don’t scale to SARS-CoV-2 volumes of genomes.

Cornelius Roemer has suggested back-annotating deletion annotations onto the UShER tree via metadata, similar to Nextstrain’s augur pipeline which (IIRC) can add back masked mutations after a tree is constructed. I think that would take a bit longer for 16+ million sequences than for the 5,000 typical for augur (and work out rather messy given the variety of genome assembly pipelines used for SARS-CoV-2), but I haven’t tried it. :slight_smile: I’m happy to share the full tree with any registered GISAID user, and the public tree is always there, if anyone has time to try it. There are also Martin Hunt et al’s viridian assemblies of SRA SARS-CoV-2 sequences (https://www.biorxiv.org/content/10.1101/2024.04.29.591666v1) for a set of millions of sequences with a uniform assembly pipeline that should detect deletions more consistently than the set of all consensus assemblies.