S3 or URL links for workflow data

adalisan · June 28, 2024, 8:41pm

I am trying to download some intermediate workflow data files from data.nextstrain.org.
According to the documentation in reference/data-files.html
these should exist in s3 buckets or urls
I have not been able to find the right urls for the data based on the documentation and my guesses based on build parameters. For example I tried,
data.nextstrain.org/nextstrain-public/files/workflows/seasonal-flu/h3n2_ha_sequences.fasta.zst
or
…org/nextstrain/files/workflows/seasonal-flu/h3n2_ha_sequences.fasta.zst
or
…org/nextstrain/files/workflows/seasonal-flu/h3n2/ha/h3n2_ha_sequences.fasta.zst
or
…org/files/workflows/flu/seasonal/h3n2/ha/2y/metadata.tsv.gz
I am actually looking for seasonal flu related data, that may be either inputs to the pipeline or output of specific snakemake rules. Please let me know if those files are publicly available, or perhaps they are not yet as mentioned in the documentation :
“The publishing of our SARS-CoV-2 (ncov) workflow’s data files led us to the goal of doing the same for our other pathogen workflows too. This work is still in-progress and not all of the examples given below exist yet.”

joverlee · June 28, 2024, 8:59pm

Hi @adalisan,

Any pathogen workflows that use GISAID data will not have any files available through data.nextstrain.org because we cannot share them publicly.

Workflows using GISAID data will usually have an source at the top of the page:
Screenshot 2024-06-28 at 1.57.51 PM

Best,
Jover

adalisan · June 28, 2024, 9:28pm

Thank you @joverlee.
Is that also true for intermediate files that are not direct derivatives of the GISAID data? For example, the LBI measure for clades which seem to be based on phylogenetic trees.
How about public builds? Are there no public nextstrain builds for seasonal flu?

adalisan · June 28, 2024, 9:41pm

I realized the urls I provided as examples are data files containing metadata and sequences from GISAID (I used those because documentation referred to those specific files). I am actually interested in intermediate or output files from nextstrain as GI such as phylogenetic trees or titer data. Being able to download those intermediate files via urls (if they are uploaded to S3) would be very useful for our work.

jlhudd · June 28, 2024, 10:43pm

Hi @adalisan, we cannot share most intermediate, derived files based on data from GISAID. The only data we can provide for each build are the time and divergence trees, acknowledgements of authors, and the summary of genetic diversity per site that you find in the “diversity” or “entropy” panel of Auspice.

To get these files, you can navigate to one of our seasonal flu trees (e.g., 2-year H3N2 HA tree), scroll to the bottom of the right-hand pane, click the “Download Data” link, and select the files you want to download from the modal window that appears. When you are viewing the tree’s branch length display is set to “time”, then the download window shows the time tree Newick. When you view the tree’s branch lengths by “divergence”, then the download window shows the divergence tree Newick.

Related to the GISAID data sharing limitations, we also cannot share raw titer data since these are part of restricted data use agreements with our collaborators who actually produce the measurements.

If you have a GISAID account, you could try out our quickstart guide for running a seasonal-flu build with GISAID data that you download from that account. After you’ve downloaded these data from GISAID, you can run the flu workflow to get intermediate annotations like LBI, etc. Let us know if you have questions about getting that workflow running, if you end up pursuing this path.

adalisan · June 30, 2024, 12:26am

@jlhudd Thank you for your prompt and clear answer. would it be possible to get the direct urls to the files you provide without browsing to the particular link you mentioned, assuming we accept the relevant terms of use?
I do have a GISAID account, and have considered setting up the seasonal-flu workflow depending on our team’s needs. Thank you for developing this essential software.

jlhudd · July 3, 2024, 6:25pm

@adalisan Unfortunately, we don’t have direct URLs for the files provided through that download window, since the files get generated by the data visualization tool, Auspice. You may be interested in the REST API to the nextstrain.org server, though which is a programmatic way to interact with the datasets you see through Auspice on nextstrain.org. You could use this interface to parse out the tree structure or annotations on the nodes like LBI, etc.

Topic		Replies	Views
Preparing my own data Help and Getting Started	1	354	April 10, 2022
Nextstrain files/datasets in Google Cloud (GCP)? Help and Getting Started	3	30	November 22, 2024
Ingest workflow: NCBI Datasets, Entrez, or something else? Help and Getting Started	1	98	June 28, 2024
Nextmeta and nextfasta not on GISAID	34	2649	June 30, 2021
Assistance required for adding SARS-CoV-2 build to NextStrain datasets Help and Getting Started	6	466	October 25, 2021

S3 or URL links for workflow data

Related topics