I’ve recently begun testing out the nCoV workflow for my work purposes. Thanks for the great tools and workflow - very useful.
My focus is on researching SARS-CoV-2 from an Australian perspective. I followed the tutorial to define a new Australia-focused build. The analysis ran smoothly, and it was great to visualise the results using Auspice. However, there were less Australian genomes in the final tree than I expected.
Evidently I need to modify my build (subsampling schemes etc.) to get the output I want. In any case, I also took a look at defaults/exclude.txt to see if any Australian genomes were being excluded by default, and why. I last “pulled” the github repo on Friday 18th September, and from that date there are 46 Australian genomes excluded. However, if I’m interpreting the reasons why they’re excluded, I don’t think they are worth excluding by default.
For example, 42 genomes are excluded for having “future collection dates”. I checked through these samples, and their genomes aren’t in the future. An example is “Australia/VIC1948/2020”, which has a collection date of 2020-06-09 (9th June 2020). Four genomes are also excluded for “collection dates from Jan 6 with divergence resembling a much more recent virus”. However, the collection dates for the samples are actually for 2020-06-01 (1st June 2020). Based on this, I think the exclusions were because of YYYY-DD-MM interpretation of collection dates, rather than the correct YYYY-MM-DD interpretation. If I’m wrong here, please do correct me.
I’ll remove these from the default exclusions on my local copy of defaults/exclude.txt. I just thought this feedback might be useful for potentially updating the exclude.txt file on the github repo