This is copied from a Q&A via email, for general visibility in case anyone has similar questions:
Q: Can the nextclade generated QC fields (eg: QC_overall_status) be used as filters or query/exclude values in a subsampling scheme? Or are those values generated after subsampling? If so, is there a way to filter on those fields after subsampling?
In the default workflow, we only run Nextclade on the subsampled data (to produce both alignments and QC annotations), so you can’t use these values in a subsampling scheme. We filter on these fields after subsampling through the diagnostic rule that flags low quality data for exclusion.
After aligning the subsampled data, there is a final filter rule has access to the Nextclade QC metrics through the metadata. You could use the “exclude_where” parameter to apply filters based on the QC values. This parameter passes through to augur filter’s --exclude-where argument which is not as full-featured as the --query argument, but you can do simple rules like this:
filter: exclude_where: division='USA' QC_overall_status='bad'
You could also drop in a replacement “filter” rule with the ruleorder trick you’ve been using, to get more control over that final filter step.
Or, you could run Nextclade on the full input data and annotate your metadata with the resulting QC columns prior to running the workflow. Then you could reference those QC columns from subsampling rules just like any other metadata.
Do any of these solutions sound like they’d work for what you’re trying to do?