10 science questions on the SARS-COV-2 data

Hey, guys:

I am still stuck with a list of fundamental questions for the SARS-COV-2 genomic data, before I could fully enjoy the fancy tools of NextStrain. I think a clear clarification on fundamental questions are quite key for data analysis software and tools. So, I hope that this group won’t feel bothered by my list of questions. Your kind response to any of these 10 questions would be greatly appreciated!

Thank you very much & best regards,



Jie Huang, MD MPH PhD

  • I guess we all download the virus genome data from GISAID, which used the WIV04 as reference for alignment. However, the published sequence of this virus (MN996528.1) ended with gene=“N”. So, where could I find the positions for ORF10 and 3’UTR for this reference WIV04?

  • From GISAID, I assume that the most commonly used data for downloading should be the 3 MSA files under “ Alignment and proteins ”. I found the FULL data has 36,801 letters for each genome. Why it is much more than the reference (N=29,891)? The Unmasked and Masked file both have 29,891 letters, but it seems that “-” and “n” are different in these two datasets. What does “-” and “n” mean in these two datasets? We all know face mask these days. What are exactly “masked” in the Masked file?

  • For the FASTA file under the section of “ Download packages ”, it has a total of 29,862 letters for each genome. Why it missed 29 base compared to the reference? Is this the unaligned raw file that Nextstrain and other alignment software usually use as input? I am also a bit puzzled to find that I could not find the sample EPI_ISL_426900 in this data, which is the first sample in the MSA files mentioned above.

  • I think the first publicly posted SARS-COV-2 genome is Wuhan-Hu-1 (NC_045512.2). This genome has 29903 nucleic acids, 1,012 more than WIV04 (N=28891) mentioned above. Nextstrain assigned the clade of “ 19A ” to Wuhan-Hu-1 and “ 19B ” to WIV04 , and then the D614G mutation made “ 19A ” become “ 20A ”, correct? Just curious, since there is “WIV04”, where could we find sequence data for the “ WIV01 ”?

  • When we get the FASTA file of the reference genome, is there a simple and straight-forward bioinformatics approach to identify what proteins are coded by this genome, and the start and end positions of each of the coded proteins? I thought that this is something seemingly very easy to do, based on the central dogma of molecular biology . However, I did not find an easy one-liner to do this. Do we must go through some complicated BLAST process in order to find out what genes are coded by the SARS-COV-2 genome?

  • Now, everybody is talking about Delta variant. However, once I got a SARS-COV-2 genome data, do I must run a phylogeny analysis to get a “21A” from Nextstrain and a “B.1.617.2” from PANGO to declare a Delta variant? Does a Nextstrain clade of “21A” definitely mean “Delta”, and vice versa? Is there a one-to-one relationship between Delta and certain mutations? I had hoped to find this information from the GISAID metadata file that I downloaded from the “Download packages” section. But I am very suspicious of quality of that metadata file. For example, the maximum value of “ Sequence.length ” in this file is 148,351, which does not seem to make sense.

  • It says that HIV has 9 subtypes, HCV has 7 subtypes, HPV has at least 9 subtypes (since there is “9-valent vaccine”). Is “Delta variant” or Nextstrain “21H” clade of SARS-COV-2 a subtype? How many subtypes do SARS-COV-2 have?

  • The PANGO phylogeny tree is “ dynamic ”. I think “dynamic” means that the subtype group will change depending on time and the expansion of the mutation. So, for an exactly same virus and sequencing data, its label could be B.1.17 this month, and then change to something like B.2.17 next month, correct?

  • Do only mutations in the spike protein coding region matter to the transmission of the virus? If so, I guess we don’t even need to sequence the whole genome data of SARS-COV-2, but only targeted sequencing of S protein coding region is enough.

  • Sometimes, a single mutation is enough to make a Delta variant. Other times, another Delta variant might have 10 mutations. Is there a mathematic formula or prediction algorithm to calculate the cumulative effect of 10 mutations vs. 1 mutation, for example?