Sequence missing after certain dates

wwei54 · January 16, 2024, 3:31pm

Hello, I am using the nextstrain SARS-CoV-2 pipeline for build that focus on state. all the data and metadata was download from GISAID. I realize my build only contain sequences before 2022-06. And I can’t figure out why. I confirmed that there is at least 20 seqs per month after 2022-06 with the correct metadata. I also know the data is passed in the initial filtering but seems they have been filtered out in the following step. Does anyone can provide some clues for me to figure this out? Thank you so much!

ncov_Wisconsin_introduction_only_wic_exp.json (1.5 MB)

builds:
  Wisconsin_introduction_only_wic_exp:
    subsampling_scheme: wisconsin # <- use the 'canton' sampling scheme
    region: North America
    country: USA
    division: Wisconsin

subsampling:
  wisconsin:
    # Focal samples for division
    division:
      group_by: "country year month"
      seq_per_group: 10
      #max_sequences: 500
      query: --query "division == '{division}'"
      #sampling_scheme: --no-probabilistic-sampling
      exclude: --exclude-where 'region!={region}' 'country!={country}' 'division!={division}'
  
files:
  auspice_config: "my-ncov-analyses/auspice-config-custom-data.json"

corneliusroemer · January 16, 2024, 5:23pm

Hello! What you’ve shared here is quite limited so it’s very hard to help. I can’t see anything wrong based on what you’ve shared.

It looks like you’re using the ncov workflow or a fork thereof.

It’s generally better to write your own workflow and use ncov as inspiration as opposed to trying to modify ncov to do what you want to do. Ncov is very complicated and does a lot of things you probably don’t need.

My recommendation would thus be to start from scratch with a simple workflow and understand what each rule does and build it up from there.

A good general starting point for Nextstrain builds is the Zika tutorial and the docs in general.

If you want to look at a very simple but production workflow that uses SARS-CoV-2 and GISAID data, you can look at, for instance, my BA.2.86 workflow.

If you do want to figure out the root cause of the missing sequences, you should be able to dive into the logs of the snakemake workflow. augur filter outputs reasons why sequences aren’t included.

wwei54 · January 16, 2024, 6:41pm

Thank you! I realized all the sequences were excluded because they appear in the excluded_by_diagnostics.txt file. Is there a way that I can skip the ncov workflow?

corneliusroemer · January 16, 2024, 7:09pm

Good debugging! What’s the reason given in the excluded_by_diagnostics.txt file for these sequences?

I’m not 100% sure I understand what you mean by “Is there a way that I can skip the ncov workflow” - do you mean skip that rule so your sequences don’t get excluded?

In general it’s probably better to figure out why the sequences get excluded and fix the root cause for the exclusion. But yes in theory you should be able to tweak the rule so everything gets passed through without excluding anything.

I’m sure you’ve seen this part of the docs: Troubleshoot common issues — SARS-CoV-2 Workflow documentation

wwei54 · January 16, 2024, 8:46pm

I rechecked all the sequences that have been excluded, they all have a clock_deviation larger than 20, Thus I am planning to manually update the number in the main workflow. so they should be kept in the tree. Is that common to modify the clock_deviation rate? Thank you

corneliusroemer · January 16, 2024, 11:16pm

Yep it can happen that the clock deviation is off, I don’t know exactly how it’s calculated here. Just increase the numbers to something like 100 in both directions:

--clock-filter-recent 100
--clock-filter 100

github.com

nextstrain/ncov/blob/f79bb86c0ba20f685a877e8fa7fbe803ddcf2d9f/scripts/diagnostic.py#L36-L37


      
          parser.add_argument("--clock-filter-recent", type=float, default=20, help="maximal allowed deviation from the molecular clock")
          parser.add_argument("--clock-filter", type=float, default=15, help="maximal allowed deviation from the molecular clock")

Topic		Replies	Views
Why do my sequences end up in excluded_by_diagnostics.txt? Help and Getting Started	3	750	October 18, 2022
Number of subsampled metadata and sequences lower than indexed General	1	332	October 31, 2022
Feedback on some default exclusions for nCoV workflow General	1	464	September 25, 2020
Regarding Build for USA- Missing Data Help and Getting Started	9	540	October 27, 2021
Problems creating a SARS-CoV-2 BA.5 build Help and Getting Started	3	436	June 28, 2022

Sequence missing after certain dates

Related topics