Spike protein sequences filtered for lineage

Hi everyone,

I am a biologist in industry R&D and I have been using GISAID data for analysis and Nextstrain/CoVariants as tool to follow development of new lineages. These resources are just fantastic and I am amazed by the collaborative effort that is going on to tackle the pandemic.

So my question that I have is less related to Nextstrain itself, but I hoped (acutally I am quite convinced) that members of the group have the insight to help. I have minimal programming skills, but it’s sufficient to analyze smaller datasets.

I would need Spike protein sequences for the main lineages as fasta file.

I can download genomic sequence data from GISAID for each lineage. And I can download all Spike protein sequences (but not filter for lineage). Do you know if the data I am looking for is accessible already. Or if there is an “easy” way to get to it with minimal knowledge in R?

Really appreciate anyone’s help!


Hi Matthias,

Nextclade or Nextalign may be what you’re looking for https://clades.nextstrain.org/

Nextalign (subset of Nextclade) outputs aligned spike sequences.

Nextclade also calls amino acid mutations as a sparse list, say S:501Y, S:614G etc. so you automatically get only the differences to reference.

Nextstrain publishes metadata of a lot of samples that are on Genbank (most sequences from US, Germany, UK, Switzerland, but missing many countries, still contains samples from most common variants) annotated with mutations, pango lineage, etc.

You can simply download it using wget from here: Overview of remote nCoV files (intermediate build assets) — SARS-CoV-2 Workflow documentation

If you need full spike details, what you could do is download one sequence per lineage you need, run it through Nextclade to get the spike protein.

It all depends a bit on what you’re trying to achieve, there are many ways, some simpler but more restricted, some more complicated but more general.