Nextstrain files/datasets in Google Cloud (GCP)?

AngieHinrichs · November 22, 2024, 12:47am

Hi! I’m interested in pulling down your MPXV Clade I open data tree(s) and metadata for use in the UShER web interface with attribution, if that’s OK with you. I.e. people could upload sequences to have them placed in your tree, or search for names/IDs, and get back subtrees to view using nextstrain.org/fetch/.

I found the documentation page Data files — Nextstrain documentation which says gs://nextstrain-data has subdirectories files/datasets/... but when I do gsutil ls -l gs://nextstrain-data/files/ it shows only ncov and workflows:

                                 gs://nextstrain-data/files/ncov/
                                 gs://nextstrain-data/files/workflows/

– no datasets. Is files/datasets/ planned for future work? I see gs://nextstrain-data/files/workflows/mpox/ has sequences and metadata that could be used to build trees, just wondering if the trees are also available as easy downloads or if I’m overlooking something.

james · November 22, 2024, 12:59am

We didn’t end up adding the “build files” (to use the language of the linked docs page). From memory we didn’t think there would be demand, but also it’s always easier to not do something!

just wondering if the trees are also available

Yes - either via S3 (and I presume google cloud, but I haven’t tested):

s3://nextstrain-data/mpox_clade-I.json
s3://nextstrain-data/mpox_clade-I_root-sequence.json

or via the RESTful API

curl -H 'Accept:application/json' https://nextstrain.org/mpox/clade-I
curl -H 'Accept:application/vnd.nextstrain.dataset.root-sequence+json' https://nextstrain.org/mpox/clade-I

or via the nextstrain CLI nextstrain remote download nextstrain.org/mpox/clade-I

The latter two are nicest as you can tag on @YYYY-MM-DD to the URLs and get the snapshot of the tree that was live on that day.

james · November 22, 2024, 1:01am

I’m interested in pulling down your MPXV Clade I open data tree(s) and metadata for use in the UShER web interface with attribution, if that’s OK with you.

I’ll raise this with the wider group and get back to you. I don’t see an issue myself – all data’s via NCBI after all.

AngieHinrichs · November 22, 2024, 7:36pm

Thanks so much for your answers James! Especially for including the tree’s root sequence, that is important for UShER.

As I look at the clade I tree more closely, I think for UShER purposes it would be better if I make my own separate trees for Ia and Ib because they’re so diverged from each other. So this question is not at all urgent. But in the long term I would like to make a workflow where one can add a new pathogen to usher.bio starting with one of Nextstrain’s many excellent trees, or even perhaps an Augur build.

Topic		Replies	Views
S3 or URL links for workflow data Help and Getting Started	6	151	July 3, 2024
Exporting accessions from a prebuilt tree Help and Getting Started	2	493	July 14, 2020
Metadata.tsv: building custom nextclade dataset Help and Getting Started	3	107	August 20, 2025
Nextstrain phylogenetic tree in the format used by UCSC General	9	203	December 16, 2025
Nextstrain community builds no dataset available Help and Getting Started	1	372	April 3, 2022

Nextstrain files/datasets in Google Cloud (GCP)?

Related topics