Scatterplot S1 mutations from exported data

Hi there,

I am Duc and I am working for a Swiss media outlet. I was a computational biologist long time ago, so my apologies if my question under is silly!

I wish to create my own scatterplot of SARS-CoV-2 S1 mutations, something along the lines of this chart

This is also for a media story to illustrate covid’s evolution and mutations.

I can plot a S1 scatter from the Nextstrain interface (S1 mutations against time, coloured by clade), however I can’t wrangle the exported data to get a data frame that I could use to recreate such scatterplot. I can’t even find S1 mutation counts from the .nwk/.nexus exported data.

I use R & tidyverse and I tried a whole bunch of R phylogenetic packages, but without any success. Thanks in advance :pray:t3:

Thanks for writing @Duc. Generating a scatterplot of date on x, S1 mutations on y and coloring by clade or variant should be straight forward from the “metadata” export. Unfortunately, as we state in the page footer, we’re prohibited from sharing metadata for GISAID analyses. However, there should be sufficient open data from GenBank that you can well address your question. If you want the full pandemic timeline I’d work from:

You can then click on “Download Data” at the very bottom of the page and then click “Metadata TSV”. This will download a file called nextstrain_ncov_open_global_all-time_metadata.tsv that includes fields: date, S1_mutations and clade_membership that lists variant name.

Please let us know if you have further questions here.

Thank you, that’s very helpful ! I spent so much time trying to figure out the .nexus/.nwk format that I didn’t look enough into the metadata, duh. Cheers

Great! Happy to provide further advice if helpful. Good luck with the story.