Correspondance of SARS-CoV-2 annotations (Nextclade - Pangolin)

aechchiki · December 3, 2021, 10:04am

Is there any annotation file providing correspondence between Nextclade and Pangolin variant nomenclature/annotations, to annotate some SARS-CoV-2 genomes from Gisaid with both these?

For the moment I could only find:

What I’m looking for is an annotation as follows:

  lineage    clade
  AY.43       21J 
  AY.4         21J
  ...        ...

The idea is to generate a table as follows:

genome                                                           lineage    clade
hCoV-19/Germany/BW-RKI-I-195742/2021   AY.43       21J
...                                                                     ...             ...

The way I proceed now is the following:

Parse the Pangolin’s lineage designation file to find a target genome
Look for the target genome in Gisaid (for quality check)
Parse the metadata associated to the selected genome to get the lineage for verification (only Pangolin lineage is reported)

These steps allow me to build the first part of the table:

 genome                                                           lineage    
 hCoV-19/Germany/BW-RKI-I-195742/2021   AY.43      
 ...                                                                     ...

Then:

Paste the genome to Nextclade, wait for the analysis to be completed
Report the calculated clade

With this information, I can then build the table:

 genome                                                          lineage    clade
 hCoV-19/Germany/BW-RKI-I-195742/2021   AY.43      21J
  ...                                                                    ...            ...

The idea behind this is that both lineages and clades can be used for “tagging” a genome, depending on the aim (a given clade e.g. 21J includes many lineages e.g. AY.43 and AY.4 in the example above) and that I don’t enjoy manual and error-prone procedures;) Thanks for your input.

AngieHinrichs · December 8, 2021, 12:03am

Hi - one possible source is the metadata file for UCSC’s daily-updated tree of public SARS-CoV-2 sequences:

https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.metadata.tsv.gz

It’s a big file – almost 3 million lines – because there are about 3M sequences in GenBank & COG-UK at this point and that number will probably grow. The last two columns of the file (10th & 11th) are the Nextstrain clade and Pango lineage assigned to each sequence based on its placement in the UCSC/UShER tree and the tree’s annotation of where each clade/lineage starts.

This command will make a file relating each Nextstrain clade to at least one lineage (often to a lot of lineages, especially for Delta):

gunzip -c public-latest.metadata.tsv.gz | cut -f 10,11 | sort -u > cladeToLineages.tsv

Some Pango lineages have no public sequences (only GISAID sequences, which can’t be shared at a public URL), so those lineages will be missing from that file. Some older lineages are not monophyletic on the UCSC/UShER tree, so those lineages are not annotated and will be missing. (In total, ~200 / ~1500 are missing.)

aechchiki · December 8, 2021, 10:40am

Thanks @AngieHinrichs this helps a lot.

Topic		Replies	Views
Map NextStrain names to others? General	2	1064	February 5, 2021
Pango lineage (Nextclade) General	2	614	May 2, 2022
Downloading SARS-CoV-2 data from Nextstrain Help and Getting Started	1	410	May 14, 2022
How can I know the Nextclade_pango for each of SC2 sequences?	4	1017	June 15, 2024
Web App and Nextclade (Pango) General	1	66	November 1, 2024

Correspondance of SARS-CoV-2 annotations (Nextclade - Pangolin)

Related topics