Correspondance of SARS-CoV-2 annotations (Nextclade - Pangolin)

Is there any annotation file providing correspondence between Nextclade and Pangolin variant nomenclature/annotations, to annotate some SARS-CoV-2 genomes from Gisaid with both these?

For the moment I could only find:

What I’m looking for is an annotation as follows:

  lineage    clade
  AY.43       21J 
  AY.4         21J
  ...        ...

The idea is to generate a table as follows:

genome                                                           lineage    clade
hCoV-19/Germany/BW-RKI-I-195742/2021   AY.43       21J
...                                                                     ...             ...

The way I proceed now is the following:

  • Parse the Pangolin’s lineage designation file to find a target genome
  • Look for the target genome in Gisaid (for quality check)
  • Parse the metadata associated to the selected genome to get the lineage for verification (only Pangolin lineage is reported)

These steps allow me to build the first part of the table:

 genome                                                           lineage    
 hCoV-19/Germany/BW-RKI-I-195742/2021   AY.43      
 ...                                                                     ...

Then:

  • Paste the genome to Nextclade, wait for the analysis to be completed
  • Report the calculated clade

With this information, I can then build the table:

 genome                                                          lineage    clade
 hCoV-19/Germany/BW-RKI-I-195742/2021   AY.43      21J
  ...                                                                    ...            ...

The idea behind this is that both lineages and clades can be used for “tagging” a genome, depending on the aim (a given clade e.g. 21J includes many lineages e.g. AY.43 and AY.4 in the example above) and that I don’t enjoy manual and error-prone procedures;) Thanks for your input.

Hi - one possible source is the metadata file for UCSC’s daily-updated tree of public SARS-CoV-2 sequences:

https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.metadata.tsv.gz

It’s a big file – almost 3 million lines – because there are about 3M sequences in GenBank & COG-UK at this point and that number will probably grow. The last two columns of the file (10th & 11th) are the Nextstrain clade and Pango lineage assigned to each sequence based on its placement in the UCSC/UShER tree and the tree’s annotation of where each clade/lineage starts.

This command will make a file relating each Nextstrain clade to at least one lineage (often to a lot of lineages, especially for Delta):

gunzip -c public-latest.metadata.tsv.gz | cut -f 10,11 | sort -u > cladeToLineages.tsv

Some Pango lineages have no public sequences (only GISAID sequences, which can’t be shared at a public URL), so those lineages will be missing from that file. Some older lineages are not monophyletic on the UCSC/UShER tree, so those lineages are not annotated and will be missing. (In total, ~200 / ~1500 are missing.)

Thanks @AngieHinrichs this helps a lot.:slight_smile: