Spike protein sequences filtered for lineage

matthabjan · February 8, 2022, 3:08pm

Hi everyone,

I am a biologist in industry R&D and I have been using GISAID data for analysis and Nextstrain/CoVariants as tool to follow development of new lineages. These resources are just fantastic and I am amazed by the collaborative effort that is going on to tackle the pandemic.

So my question that I have is less related to Nextstrain itself, but I hoped (acutally I am quite convinced) that members of the group have the insight to help. I have minimal programming skills, but it’s sufficient to analyze smaller datasets.

I would need Spike protein sequences for the main lineages as fasta file.

I can download genomic sequence data from GISAID for each lineage. And I can download all Spike protein sequences (but not filter for lineage). Do you know if the data I am looking for is accessible already. Or if there is an “easy” way to get to it with minimal knowledge in R?

Really appreciate anyone’s help!

Matthias

corneliusroemer · February 10, 2022, 8:12am

Hi Matthias,

Nextclade or Nextalign may be what you’re looking for https://clades.nextstrain.org/

Nextalign (subset of Nextclade) outputs aligned spike sequences.

Nextclade also calls amino acid mutations as a sparse list, say S:501Y, S:614G etc. so you automatically get only the differences to reference.

Nextstrain publishes metadata of a lot of samples that are on Genbank (most sequences from US, Germany, UK, Switzerland, but missing many countries, still contains samples from most common variants) annotated with mutations, pango lineage, etc.

You can simply download it using wget from here: Overview of remote nCoV files (intermediate build assets) — SARS-CoV-2 Workflow documentation

If you need full spike details, what you could do is download one sequence per lineage you need, run it through Nextclade to get the spike protein.

It all depends a bit on what you’re trying to achieve, there are many ways, some simpler but more restricted, some more complicated but more general.

Topic		Replies	Views
From a newbie: difficulty finding multiple coincident mutations in spike	1	386	March 8, 2021
Downloading SARS-CoV-2 data from Nextstrain Help and Getting Started	1	403	May 14, 2022
Regarding Extracting Nucleotide Mutations General	7	580	June 25, 2021
Large, novel Spike deletions of unknown origin (not reproducible outside nextstrain framework) Help and Getting Started	3	568	December 23, 2021
Nextclade cli - shortcuts to get just seqName and Nextclade_pango for all recent GISAID samples Help and Getting Started	7	502	January 14, 2023

Spike protein sequences filtered for lineage

Related topics