Feedback on some default exclusions for nCoV workflow

cfoster · September 23, 2020, 1:21am

Hi all,

I’ve recently begun testing out the nCoV workflow for my work purposes. Thanks for the great tools and workflow - very useful.

My focus is on researching SARS-CoV-2 from an Australian perspective. I followed the tutorial to define a new Australia-focused build. The analysis ran smoothly, and it was great to visualise the results using Auspice. However, there were less Australian genomes in the final tree than I expected.

Evidently I need to modify my build (subsampling schemes etc.) to get the output I want. In any case, I also took a look at defaults/exclude.txt to see if any Australian genomes were being excluded by default, and why. I last “pulled” the github repo on Friday 18th September, and from that date there are 46 Australian genomes excluded. However, if I’m interpreting the reasons why they’re excluded, I don’t think they are worth excluding by default.

For example, 42 genomes are excluded for having “future collection dates”. I checked through these samples, and their genomes aren’t in the future. An example is “Australia/VIC1948/2020”, which has a collection date of 2020-06-09 (9th June 2020). Four genomes are also excluded for “collection dates from Jan 6 with divergence resembling a much more recent virus”. However, the collection dates for the samples are actually for 2020-06-01 (1st June 2020). Based on this, I think the exclusions were because of YYYY-DD-MM interpretation of collection dates, rather than the correct YYYY-MM-DD interpretation. If I’m wrong here, please do correct me.

I’ll remove these from the default exclusions on my local copy of defaults/exclude.txt. I just thought this feedback might be useful for potentially updating the exclude.txt file on the github repo

Cheers,
Charles

james · September 25, 2020, 1:08am

Thanks Charles – my guess is that the GISAID metadata has been updated since we flagged them as “future collection dates”, rather than a YYYY-DD-MM interpretation.

We should now remove them from the exclude list – i’ll make a GitHub issue to track this.

Thanks!
james

Update: see https://github.com/nextstrain/ncov/issues/492

Topic		Replies	Views
Sequence missing after certain dates General	5	256	January 16, 2024
Why do my sequences end up in excluded_by_diagnostics.txt? Help and Getting Started	3	783	October 18, 2022
Samples have missing collection date and country info Help and Getting Started	2	207	July 10, 2024
Problems for the SARS-CoV-2 Workflow	1	436	May 19, 2023
Guide to filtering GISAID data for division-specific SARS-CoV-2 builds Help and Getting Started	3	1555	April 17, 2024

Feedback on some default exclusions for nCoV workflow

Related topics