Thank you very much, James!
I manually changed the column name of Accession ID to strain so that the downstream augur --filter knows which column to pick. BTW, I found the –exclude-all --include VIP.list a bit counter-intuitive. Why do I need to use “–exclude-all” if I already said “–include”?
As shown above, I have been downloading GISAID files under the “Alignment and proteins” section, with green circle. I think this will save a lot of time, and it is good for the international community to have a same aligned file, from a trustworthy resource. I guess a lot of technical savvy guys prefer to download the unaligned FASTA file, shown in red circle. Just curious, does Nextstrain use a similar alignment method as GISAID? I think all genomes were aligned against the reference genome WIV04, one by one. So, where does this “multiple sequence alignment” come from? There is only one-to-one alignment, correct?
Previously I have been working with human genome data. I always start with the FASTQ file, and then convert to BAM file. We now only have FASTA file from GISAID, no FASTQ file. Is it because virus genome is small and very easy to sequence with high confidence, and therefore, no quality measurement is needed?
Finally, is Nextstrain’s phylogenetic tree derivation based on distance method, while MEGA based on maximum likelihood method, so that MEGA can only process tens of genomes, not millions of genomes?
Thank you very much!