Sequence missing after certain dates

Hello, I am using the nextstrain SARS-CoV-2 pipeline for build that focus on state. all the data and metadata was download from GISAID. I realize my build only contain sequences before 2022-06. And I can’t figure out why. I confirmed that there is at least 20 seqs per month after 2022-06 with the correct metadata. I also know the data is passed in the initial filtering but seems they have been filtered out in the following step. Does anyone can provide some clues for me to figure this out? Thank you so much!

ncov_Wisconsin_introduction_only_wic_exp.json (1.5 MB)

builds:
  Wisconsin_introduction_only_wic_exp:
    subsampling_scheme: wisconsin # <- use the 'canton' sampling scheme
    region: North America
    country: USA
    division: Wisconsin

subsampling:
  wisconsin:
    # Focal samples for division
    division:
      group_by: "country year month"
      seq_per_group: 10
      #max_sequences: 500
      query: --query "division == '{division}'"
      #sampling_scheme: --no-probabilistic-sampling
      exclude: --exclude-where 'region!={region}' 'country!={country}' 'division!={division}'
  
files:
  auspice_config: "my-ncov-analyses/auspice-config-custom-data.json"

Hello! What you’ve shared here is quite limited so it’s very hard to help. I can’t see anything wrong based on what you’ve shared.

It looks like you’re using the ncov workflow or a fork thereof.

It’s generally better to write your own workflow and use ncov as inspiration as opposed to trying to modify ncov to do what you want to do. Ncov is very complicated and does a lot of things you probably don’t need.

My recommendation would thus be to start from scratch with a simple workflow and understand what each rule does and build it up from there.

A good general starting point for Nextstrain builds is the Zika tutorial and the docs in general.

If you want to look at a very simple but production workflow that uses SARS-CoV-2 and GISAID data, you can look at, for instance, my BA.2.86 workflow.

If you do want to figure out the root cause of the missing sequences, you should be able to dive into the logs of the snakemake workflow. augur filter outputs reasons why sequences aren’t included.

Thank you! I realized all the sequences were excluded because they appear in the excluded_by_diagnostics.txt file. Is there a way that I can skip the ncov workflow?

1 Like

Good debugging! What’s the reason given in the excluded_by_diagnostics.txt file for these sequences?

I’m not 100% sure I understand what you mean by “Is there a way that I can skip the ncov workflow” - do you mean skip that rule so your sequences don’t get excluded?

In general it’s probably better to figure out why the sequences get excluded and fix the root cause for the exclusion. But yes in theory you should be able to tweak the rule so everything gets passed through without excluding anything.

I’m sure you’ve seen this part of the docs: Troubleshoot common issues — SARS-CoV-2 Workflow documentation

1 Like

I rechecked all the sequences that have been excluded, they all have a clock_deviation larger than 20, Thus I am planning to manually update the number in the main workflow. so they should be kept in the tree. Is that common to modify the clock_deviation rate? Thank you

Yep it can happen that the clock deviation is off, I don’t know exactly how it’s calculated here. Just increase the numbers to something like 100 in both directions:

--clock-filter-recent 100
--clock-filter 100