This is also for a media story to illustrate covid’s evolution and mutations.
I can plot a S1 scatter from the Nextstrain interface (S1 mutations against time, coloured by clade), however I can’t wrangle the exported data to get a data frame that I could use to recreate such scatterplot. I can’t even find S1 mutation counts from the .nwk/.nexus exported data.
I use R & tidyverse and I tried a whole bunch of R phylogenetic packages, but without any success. Thanks in advance
Thanks for writing @Duc. Generating a scatterplot of date on x, S1 mutations on y and coloring by clade or variant should be straight forward from the “metadata” export. Unfortunately, as we state in the page footer, we’re prohibited from sharing metadata for GISAID analyses. However, there should be sufficient open data from GenBank that you can well address your question. If you want the full pandemic timeline I’d work from:
You can then click on “Download Data” at the very bottom of the page and then click “Metadata TSV”. This will download a file called nextstrain_ncov_open_global_all-time_metadata.tsv that includes fields: date, S1_mutations and clade_membership that lists variant name.
Please let us know if you have further questions here.
Thank you, that’s very helpful ! I spent so much time trying to figure out the .nexus/.nwk format that I didn’t look enough into the metadata, duh. Cheers