Nextstrain files/datasets in Google Cloud (GCP)?

Hi! I’m interested in pulling down your MPXV Clade I open data tree(s) and metadata for use in the UShER web interface with attribution, if that’s OK with you. I.e. people could upload sequences to have them placed in your tree, or search for names/IDs, and get back subtrees to view using nextstrain.org/fetch/.

I found the documentation page Data files — Nextstrain documentation which says gs://nextstrain-data has subdirectories files/datasets/... but when I do gsutil ls -l gs://nextstrain-data/files/ it shows only ncov and workflows:

                                 gs://nextstrain-data/files/ncov/
                                 gs://nextstrain-data/files/workflows/

– no datasets. Is files/datasets/ planned for future work? I see gs://nextstrain-data/files/workflows/mpox/ has sequences and metadata that could be used to build trees, just wondering if the trees are also available as easy downloads or if I’m overlooking something.

We didn’t end up adding the “build files” (to use the language of the linked docs page). From memory we didn’t think there would be demand, but also it’s always easier to not do something!

just wondering if the trees are also available

Yes - either via S3 (and I presume google cloud, but I haven’t tested):

  • s3://nextstrain-data/mpox_clade-I.json
  • s3://nextstrain-data/mpox_clade-I_root-sequence.json

or via the RESTful API

  • curl -H 'Accept:application/json' https://nextstrain.org/mpox/clade-I
  • curl -H 'Accept:application/vnd.nextstrain.dataset.root-sequence+json' https://nextstrain.org/mpox/clade-I

or via the nextstrain CLI nextstrain remote download nextstrain.org/mpox/clade-I

The latter two are nicest as you can tag on @YYYY-MM-DD to the URLs and get the snapshot of the tree that was live on that day.

1 Like

I’m interested in pulling down your MPXV Clade I open data tree(s) and metadata for use in the UShER web interface with attribution, if that’s OK with you.

I’ll raise this with the wider group and get back to you. I don’t see an issue myself – all data’s via NCBI after all.

1 Like

Thanks so much for your answers James! Especially for including the tree’s root sequence, that is important for UShER.

As I look at the clade I tree more closely, I think for UShER purposes it would be better if I make my own separate trees for Ia and Ib because they’re so diverged from each other. So this question is not at all urgent. But in the long term I would like to make a workflow where one can add a new pathogen to usher.bio starting with one of Nextstrain’s many excellent trees, or even perhaps an Augur build. :slight_smile: