Nextstrain phylogenetic tree in the format used by UCSC

Is there a downloadable version of the SARS-CoV-2 Nextstrain phylogenetic tree (from GISAID data) with the same (or similar) format to the one used by UCSC (here: kent/src/hg/utils/otto/sarscov2phylo/pango.clade-mutations.tsv at master · ucscGenomeBrowser/kent · GitHub)?
I found the newick file, but I am looking for one that already incorporates mutations that differentiate the intermediate nodes of the tree (either nucleotide or even amino acid mutations). Thank you.

Hello – sharing a tree that includes GISAID data is usually not possible due to GISAID restrictions (the permission of the submitter of each GISAID sequence that appears in the tree would be required). I’m guessing that is why the Nextstrain team has not replied yet.

However, the nextclade dataset for SARS-CoV-2 includes an Auspice v2 JSON tree, which I believe is constructed mostly from lineage-consensus pseudo-sequences. Auspice v2 JSON trees have mutations annotated on the nodes. The nextclade tree should have very similar mutations to pango.clade-mutations.tsv:

https://raw.githubusercontent.com/nextstrain/nextclade_data/master/data/nextstrain/sars-cov-2/wuhan-hu-1/orfs/tree.json

The format is completely different (a JSON tree vs. TSV) but again, the set of mutations along the path to the node for each lineage should mostly agree, with some exceptions:

  • the nextclade tree represents deletions as one “-” per base, while indels are completely ignored by UShER so they’re missing from the UCSC file
  • the UShER tree masks Problematic Sites, so those never appear in the UCSC file
  • the UShER tree has branch-specific masking of many sites that are prone to sequencing artefacts. Reversions on those sites are removed from the UShER tree, which unfortunately causes some recombinants to appear (in the UShER tree and UCSC file) to have mutations that they do not have.

@corneliusroemer also maintains a github repository with the consensus sequence for each Pango lineage; a set of mutations could be obtained directly from those by alignment to the reference. GitHub - corneliusroemer/pango-sequences: Consensus sequences for each Pango lineage

Thank you very much for this explanation. Indeed, I had not considered that the GISAID data may not be publicly exposed.
Otherwise, the Nextstrain Json file is interesting and seems to align well with your UCSC tree (thank you for enumerating the specific differences between the two approaches!). My concern is with the “updating policy” of these two valuable structures: how often do you update the UCSC tree?
As per the repo with the consensus sequences, this is also very interesting, however, I prefer not to depend too much on the specific lineages but, rather, to be able to observe what happens on the intermediate nodes as well.