Updated example command needed for updated GISAID file

Hi, there:

I noticed that the latest version of the GISAID FASTA file that I downloaded using EPI_ISL_12345 as the Virus name. So, for me to extract a certain list of genomes from this file, thee is no need to use sanitize_sequences.py --strip-prefixes any more, I guess. Also, for the augur filter --metadata XXX.metadata.tsv --exclude-all --include XXX.ids command, do the IDs in my XXX.ids file must match the IDs in the first column of the XXX.metadata.tsv file? Otherwise, how does augur know which column is the joining field since we did not specify it in the command?

Thanks!

Jie

how does augur know which column is the joining field since we did not specify it in the command

For metadata files, augur will use “strain” or “name” as the ID column, in that order (code here).

So, for me to extract a certain list of genomes from this file, thee is no need to use sanitize_sequences.py --strip-prefixes any more, I guess.

In this case it seems you can skip this step (but equally, it won’t hurt - if the prefix isn’t there, nothing will be stripped).

Thank you very much, James!

I manually changed the column name of Accession ID to strain so that the downstream augur --filter knows which column to pick. BTW, I found the –exclude-all --include VIP.list a bit counter-intuitive. Why do I need to use “–exclude-all” if I already said “–include”?

222
As shown above, I have been downloading GISAID files under the “Alignment and proteins” section, with green circle. I think this will save a lot of time, and it is good for the international community to have a same aligned file, from a trustworthy resource. I guess a lot of technical savvy guys prefer to download the unaligned FASTA file, shown in red circle. Just curious, does Nextstrain use a similar alignment method as GISAID? I think all genomes were aligned against the reference genome WIV04, one by one. So, where does this “multiple sequence alignment” come from? There is only one-to-one alignment, correct?

Previously I have been working with human genome data. I always start with the FASTQ file, and then convert to BAM file. We now only have FASTA file from GISAID, no FASTQ file. Is it because virus genome is small and very easy to sequence with high confidence, and therefore, no quality measurement is needed?

Finally, is Nextstrain’s phylogenetic tree derivation based on distance method, while MEGA based on maximum likelihood method, so that MEGA can only process tens of genomes, not millions of genomes?

Thank you very much!

Best regards,
Jie

Previously I have been working with human genome data. I always start with the FASTQ file, and then convert to BAM file. We now only have FASTA file from GISAID, no FASTQ file. Is it because virus genome is small and very easy to sequence with high confidence, and therefore, no quality measurement is needed?
yes, consensus sequences can usually be called with high confidence. The quality scores are used by the upstream pipeline to call the consensus genome

Finally, is Nextstrain’s phylogenetic tree derivation based on distance method, while MEGA based on maximum likelihood method, so that MEGA can only process tens of genomes, not millions of genomes?

Nextstrain uses maximum likelihood methods (as implemented in IQ-tree). ML trees of thousands of genomes are perfectly feasible.

Dear Richard:

Thank yuo very much! I really appreciate your clarification and teaching.

Now I only have 4 questions left. Previously i listed 10. I figured some of those out myself.

  1. I think the first publicly posted SARS-COV-2 genome is Wuhan-Hu-1 (Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, co - Nucleotide - NCBI). This genome has 29903 nucleic acids. Nextstrain assigned the clade of “19A” to Wuhan-Hu-1 and “19B” to WIV04, and then the D614G mutation made “19A” become “20A”, correct? Just curious, since there is “WIV04”, where could we find sequence data for the “WIV01”? Also, why WIV-4 instead of Wuhan-Hu-1 is used as the reference genome?

  2. In the MSA file that I downloaded from GISAID, why even this reference genome WIV04 has “N” at the beginning and the end? I though that the reference should be complete. On Severe acute respiratory syndrome coronavirus 2 isolate WIV04, complet - Nucleotide - NCBI, WIV04 ended with gene=“N”. Where could I find the positions for ORF10 and 3’UTR?

  3. I found that there is a lot of “N” and “-” for the “famous” genome data of RatG13. Is that due to poor sequencing, or because the RatG13 genome simply does not have 29891 nucleic acids and therefore some “-” are required to fill the gap? When I use Biostrings R package translate() function to translate the MSA FASTA file into Amino Acid, it does not work with N and “-”. Don’t know if there is a better approach for this.

  4. Previously, when I work with human genome data, “alignment” means align each sequencing reads to the reference genome. Now, for us to derive a phylogeny tree from the SARS-COV-2 genomes, I think “alignment” actually means two different things now. First, it is to get the aligned FASTA file, so that all genomes have the same length, filled by gaps. This is already done if I download the MSA file from GISAID. Here, all genomes were aligned against the reference genome WIV04, one by one. So, even though the file name says “MSA”, that is not really MSA. However, now if I use Nextstrain or MEGA to derive phylogeny tree from this MSA FASTA file, “multiple sequence alignment” is needed in order to find relationship among those genomes in FASTA, correct?

I would deeply appreciate if you could provide quick feedback again. I understand that everybody’s time is precious, and I only ask these questions because I could not find the answers myself after I tried and tried.

Thank you very much & Best regards,
jie