I have 1 basic questions on NextStrain. I would deeply appreciate if some expert could kindly shed lights. Compared with traditional phylogenetic software such as MEGA, I feel that the biggest and coolest feature of nextStrain is that it can handle large amount of genomes. For example, I just checked, the “latest global analysis” shows “4026 of 4026 genomes sampled between Dec 2019 and May 2021”. I assume that no other phylogenetic software can do this. However, If I only want to identify the origin (Virus 0 in patient 0) of SARS-COV-2, do I really need the genome data of 4026 virus? I know that I could filter by country and exclude by date. But if I don’t know the country and the date of the virus extraction, I only have the FASTA file, can I filter out by the number of mutations? For example, can I extract top 10 SARS-COV-2 virus genomes that have the least mutations compared to RaTG13?
I also have a technical question on using the GISAID data. The following screenshots show the beginning and ending part of the genome data for the same SARS-COV-2 virus that I downloaded from GISAID. The “full” data has 35,563 characters. As we know, the SARS-COV-2 virus has ~30,000 bases. So, what are the extra 5,563 bases? Are they labels of the sequencing library? For the later part of the first line of the unmasked file, it is “actttga…”, but why the masked file becomes “-----nnn”? I assume “—” means missing while “nnn” means failed. However, the GISAID data only provides FASTA file, there is no FASTQ file with quality score. So, how this “mask” file is generated?
Your beloved lady/gentleman’s kind response would be greatly appreciated!