Why do my sequences end up in excluded_by_diagnostics.txt?

jonr · September 28, 2022, 1:13pm

Hi,
I am running a build with BA.2.75 sequences. I combine three datasets, two from Gisiad and one that I made myself, but for reasons I can’t understand all of my sequences are filtered out. They are not removed due to some query, but are removed by diagnostic.py. I tried to tweak the parameters --clock-filter, --contamination and --snp-clusters 1 but nothings seems to change…

Here’s and example of one of my sequences from the log-file:
SWIFT252231133 filter_by_exclude "[[""exclude_file"", ""results/BA/excluded_by_diagnostics.txt""]]"

How can I better understand why my sequences gets filtered out by diagnostic.py? And can I skip this step altoghether?

Here is my build-file (the BA file contains my filtered out sequences):

use_nextalign: true
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF9b"]

inputs:
  - name: "BA"
    metadata: data/BA.2.75/2022-09-28_BA.tsv
    sequences: data/BA.2.75/2022-09-28_BA.fasta
  - name: "Europe"
    metadata: data/BA.2.75/hcov_europe.tsv
    sequences: data/BA.2.75/hcov_europe.fasta
  - name: "Global"
    metadata: data/BA.2.75/hcov_global.tsv
    sequences: data/BA.2.75/hcov_global.fasta

builds:
  BA:
    subsampling_scheme: BA_scheme

subsampling:
  BA_scheme:
    allFromBA:
      exclude: "--exclude-where 'BA!=yes'"
    europeanBackground:
      query: --query "(Europe=='yes') & (country!='Norway')" # Do not include Norwegian samples
      group_by: "month"
      seq_per_group: 10
    globalBackground:
      query: --query "(Global=='yes') & (region!='Europe')" # Do not include European samples
      group_by: "month"
      seq_per_group: 5

# remove S dropout sequences and sequences without division label in US
filter:
  exclude_where: "division='USA' purpose_of_sequencing='S dropout'"
  # Removing sequences with lower than 99% coverage
  min_length: 25000 # Default is 28500
  # I don't want Omicron sequences that are before autumn 2021.
  min_date: 2021.75

# Define frequencies parameters.
frequencies:
    recent_days_to_censor: 7


# Here, you can specify what type of auspice_config you want to use
# and what description you want. These will apply to all the above builds.
# If you want to specify specific files for each build - you can!
# See the 'example_advanced_customization' builds.yaml

# Change "columns" to "location" for augur traits when having only a single country
diagnostic: 
  skip_inputs_arg=_get_skipped_inputs_for_diagnostic: true
  clock_filter: 20 # 20 default
  snp_clusters: 1 # 1 default
  contamination: 5 # 5 default


traits:
  default:
    sampling_bias_correction: 2.5
    columns: ["country"]

# Change root to South African omicron / BEST
refine:
#  root: "SouthAfrica/NICD-N22102/2021"
  root: "best"

files:
  auspice_config: "my_profiles/my_auspice_config_BA.json"
  description: "my_profiles/omicron/my_description.md"
  colors: "my_profiles/omicron/colors_norwaydivisions.tsv"
  sites_to_mask: "my_profiles/omicron/sites_ignored_for_tree_topology.txt"

And here is the snakemake log (skipped some parts due to character limitation):

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 12
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	add_branch_labels
	1	adjust_metadata_regions
	3	align
	1	all
	1	ancestral
	1	annotate_metadata_with_index
	1	build_align
	1	build_description
	1	calculate_epiweeks
	1	clade_files
	1	clades
	1	combine_input_metadata
	1	combine_samples
	1	combine_sequences_for_subsampling
	1	diagnostic
	1	distances
	1	emerging_lineages
	1	export
	1	filter
	1	finalize
	1	include_hcov19_prefix
	1	index
	1	join_metadata_and_nextclade_qc
	1	logistic_growth
	1	mask
	1	mutational_fitness
	1	recency
	1	refine
	1	rename_emerging_lineages
	3	sanitize_metadata
	3	subsample
	1	tip_frequencies
	1	traits
	1	translate
	1	tree
	41

[Wed Sep 28 12:59:04 2022]
rule sanitize_metadata:
    input: data/BA.2.75/2022-09-28_BA.tsv
    output: results/sanitized_metadata_BA.tsv.xz
    log: logs/sanitize_metadata_BA.txt
    jobid: 39
    benchmark: benchmarks/sanitize_metadata_BA.txt
    wildcards: origin=BA
    resources: mem_mb=2000


        python3 scripts/sanitize_metadata.py             --metadata data/BA.2.75/2022-09-28_BA.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_BA.tsv.xz 2>&1 | tee logs/sanitize_metadata_BA.txt
        

[Wed Sep 28 12:59:04 2022]
Job 38: 
        Aligning sequences to defaults/reference_seq.fasta
            - gaps relative to reference are considered real
        


        python3 scripts/sanitize_sequences.py             --sequences data/BA.2.75/hcov_global.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_Global.txt             | nextalign run             --jobs=8             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --output-translations results/translations/seqs_Global.gene.{gene}.fasta             --output-fasta results/aligned_Global.fasta             --output-insertions results/insertions_Global.tsv > logs/align_Global.txt 2>&1;
        xz -2 -T 8 results/aligned_Global.fasta;
        xz -2 -T 8 results/translations/seqs_Global.gene.*.fasta
        

[Wed Sep 28 12:59:04 2022]
rule sanitize_metadata:
    input: data/BA.2.75/hcov_europe.tsv
    output: results/sanitized_metadata_Europe.tsv.xz
    log: logs/sanitize_metadata_Europe.txt
    jobid: 40
    benchmark: benchmarks/sanitize_metadata_Europe.txt
    wildcards: origin=Europe
    resources: mem_mb=2000


        python3 scripts/sanitize_metadata.py             --metadata data/BA.2.75/hcov_europe.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_Europe.tsv.xz 2>&1 | tee logs/sanitize_metadata_Europe.txt
        

[Wed Sep 28 12:59:04 2022]
rule clade_files:
    input: defaults/clades.tsv
    output: results/BA/clades.tsv
    jobid: 24
    benchmark: benchmarks/clade_files_BA.txt
    wildcards: build_name=BA


        cat defaults/clades.tsv > results/BA/clades.tsv
        

[Wed Sep 28 12:59:04 2022]
rule sanitize_metadata:
    input: data/BA.2.75/hcov_global.tsv
    output: results/sanitized_metadata_Global.tsv.xz
    log: logs/sanitize_metadata_Global.txt
    jobid: 41
    benchmark: benchmarks/sanitize_metadata_Global.txt
    wildcards: origin=Global
    resources: mem_mb=2000


        python3 scripts/sanitize_metadata.py             --metadata data/BA.2.75/hcov_global.tsv             --metadata-id-columns strain name 'Virus name'             --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/                          --output results/sanitized_metadata_Global.tsv.xz 2>&1 | tee logs/sanitize_metadata_Global.txt
        
[Wed Sep 28 12:59:04 2022]
Finished job 24.
1 of 41 steps (2%) done

[Wed Sep 28 12:59:04 2022]
Job 18: Templating build description for Auspice

[Wed Sep 28 12:59:04 2022]
Finished job 18.
2 of 41 steps (5%) done
[Wed Sep 28 12:59:05 2022]
Finished job 39.
3 of 41 steps (7%) done
[Wed Sep 28 12:59:05 2022]
Finished job 40.
4 of 41 steps (10%) done
[Wed Sep 28 12:59:05 2022]
Finished job 41.
5 of 41 steps (12%) done

[Wed Sep 28 12:59:05 2022]
Job 32: 
        Combining metadata files results/sanitized_metadata_BA.tsv.xz results/sanitized_metadata_Europe.tsv.xz results/sanitized_metadata_Global.tsv.xz -> results/combined_metadata.tsv.xz and adding columns to represent origin
        


        python3 scripts/combine_metadata.py --metadata results/sanitized_metadata_BA.tsv.xz results/sanitized_metadata_Europe.tsv.xz results/sanitized_metadata_Global.tsv.xz --origins BA Europe Global --output results/combined_metadata.tsv.xz 2>&1 | tee logs/combine_input_metadata.txt
        
[Wed Sep 28 12:59:07 2022]
Finished job 32.
6 of 41 steps (15%) done

[Wed Sep 28 12:59:07 2022]
Job 35: 
        Subsample all sequences by 'globalBackground' scheme for build 'BA' with the following parameters:

         - group by: --group-by month
         - sequences per group: --sequences-per-group 5
         - subsample max sequences: 
         - min-date: 
         - max-date: 
         - 
         - exclude: 
         - include: 
         - query: --query '(Global=='"'"'yes'"'"') & (region!='"'"'Europe'"'"')'
         - priority: 
        


        augur filter             --metadata results/combined_metadata.tsv.xz             --include defaults/include.txt             --exclude defaults/exclude.txt                                                                 --query '(Global=='"'"'yes'"'"') & (region!='"'"'Europe'"'"')'                                       --group-by month             --sequences-per-group 5                                       --output-strains results/BA/sample-globalBackground.txt 2>&1 | tee logs/subsample_BA_globalBackground.txt
        

[Wed Sep 28 12:59:07 2022]
Job 34: 
        Subsample all sequences by 'europeanBackground' scheme for build 'BA' with the following parameters:

         - group by: --group-by month
         - sequences per group: --sequences-per-group 10
         - subsample max sequences: 
         - min-date: 
         - max-date: 
         - 
         - exclude: 
         - include: 
         - query: --query '(Europe=='"'"'yes'"'"') & (country!='"'"'Norway'"'"')'
         - priority: 
        


        augur filter             --metadata results/combined_metadata.tsv.xz             --include defaults/include.txt             --exclude defaults/exclude.txt                                                                 --query '(Europe=='"'"'yes'"'"') & (country!='"'"'Norway'"'"')'                                       --group-by month             --sequences-per-group 10                                       --output-strains results/BA/sample-europeanBackground.txt 2>&1 | tee logs/subsample_BA_europeanBackground.txt
        

[Wed Sep 28 12:59:07 2022]
Job 33: 
        Subsample all sequences by 'allFromBA' scheme for build 'BA' with the following parameters:

         - group by: 
         - sequences per group: 
         - subsample max sequences: 
         - min-date: 
         - max-date: 
         - 
         - exclude: --exclude-where 'BA!=yes'
         - include: 
         - query: 
         - priority: 
        


        augur filter             --metadata results/combined_metadata.tsv.xz             --include defaults/include.txt             --exclude defaults/exclude.txt                                       --exclude-where 'BA!=yes'                                                                                                                     --output-strains results/BA/sample-allFromBA.txt 2>&1 | tee logs/subsample_BA_allFromBA.txt
        
[Wed Sep 28 12:59:09 2022]
Finished job 33.
7 of 41 steps (17%) done
[Wed Sep 28 12:59:09 2022]
Finished job 34.
8 of 41 steps (20%) done
[Wed Sep 28 12:59:09 2022]
Finished job 35.
9 of 41 steps (22%) done
[Wed Sep 28 12:59:10 2022]
Finished job 38.
10 of 41 steps (24%) done

[Wed Sep 28 12:59:10 2022]
Job 37: 
        Aligning sequences to defaults/reference_seq.fasta
            - gaps relative to reference are considered real
        


        python3 scripts/sanitize_sequences.py             --sequences data/BA.2.75/hcov_europe.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_Europe.txt             | nextalign run             --jobs=8             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --output-translations results/translations/seqs_Europe.gene.{gene}.fasta             --output-fasta results/aligned_Europe.fasta             --output-insertions results/insertions_Europe.tsv > logs/align_Europe.txt 2>&1;
        xz -2 -T 8 results/aligned_Europe.fasta;
        xz -2 -T 8 results/translations/seqs_Europe.gene.*.fasta
        
[Wed Sep 28 12:59:13 2022]
Finished job 37.
11 of 41 steps (27%) done

[Wed Sep 28 12:59:13 2022]
Job 36: 
        Aligning sequences to defaults/reference_seq.fasta
            - gaps relative to reference are considered real
        


        python3 scripts/sanitize_sequences.py             --sequences data/BA.2.75/2022-09-28_BA.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_BA.txt             | nextalign run             --jobs=8             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --output-translations results/translations/seqs_BA.gene.{gene}.fasta             --output-fasta results/aligned_BA.fasta             --output-insertions results/insertions_BA.tsv > logs/align_BA.txt 2>&1;
        xz -2 -T 8 results/aligned_BA.fasta;
        xz -2 -T 8 results/translations/seqs_BA.gene.*.fasta
        
[Wed Sep 28 12:59:14 2022]
Finished job 36.
12 of 41 steps (29%) done

[Wed Sep 28 12:59:14 2022]
Job 31: 
        Combine and deduplicate aligned FASTAs from multiple origins in preparation for subsampling.
        


        python3 scripts/sanitize_sequences.py                 --sequences results/aligned_BA.fasta.xz results/aligned_Europe.fasta.xz results/aligned_Global.fasta.xz                 --strip-prefixes hCoV-19/ SARS-CoV-2/                                  --output /dev/stdout                 | xz -c -2 > results/combined_sequences_for_subsampling.fasta.xz
        
[Wed Sep 28 12:59:16 2022]
Finished job 31.
13 of 41 steps (32%) done

[Wed Sep 28 12:59:16 2022]
Job 29: 
        Combine and deduplicate FASTAs
        


        augur filter             --sequences results/combined_sequences_for_subsampling.fasta.xz             --metadata results/combined_metadata.tsv.xz             --exclude-all             --include results/BA/sample-allFromBA.txt results/BA/sample-europeanBackground.txt results/BA/sample-globalBackground.txt             --output-sequences results/BA/BA_subsampled_sequences.fasta.xz             --output-metadata results/BA/BA_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_BA.txt
        
[Wed Sep 28 12:59:18 2022]
Finished job 29.
14 of 41 steps (34%) done

[Wed Sep 28 12:59:18 2022]
Job 23: 
        Running nextclade QC and aligning sequences
            - gaps relative to reference are considered real
        


        python3 scripts/sanitize_sequences.py             --sequences results/BA/BA_subsampled_sequences.fasta.xz             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_before_nextclade_BA.txt             | nextclade run             --jobs 8             --input-dataset data/sars-cov-2-nextclade-defaults.zip             --output-tsv results/BA/nextclade_qc.tsv             --output-fasta results/BA/aligned.fasta             --output-translations results/BA/translations/aligned.gene.{gene}.fasta             --output-insertions results/BA/insertions.tsv 2>&1 | tee logs/align_BA.txt
        
[Wed Sep 28 12:59:19 2022]
Finished job 23.
15 of 41 steps (37%) done

[Wed Sep 28 12:59:20 2022]
rule join_metadata_and_nextclade_qc:
    input: results/BA/BA_subsampled_metadata.tsv.xz, results/BA/nextclade_qc.tsv
    output: results/BA/metadata_with_nextclade_qc.tsv
    log: logs/join_metadata_and_nextclade_qc_BA.txt
    jobid: 27
    benchmark: benchmarks/join_metadata_and_nextclade_qc_BA.txt
    wildcards: build_name=BA


        python3 scripts/join-metadata-and-clades.py             results/BA/BA_subsampled_metadata.tsv.xz             results/BA/nextclade_qc.tsv             -o results/BA/metadata_with_nextclade_qc.tsv 2>&1 | tee logs/join_metadata_and_nextclade_qc_BA.txt
        

[Wed Sep 28 12:59:20 2022]
Job 25: 
        Mask bases in alignment results/BA/aligned.fasta
          - masking 100 from beginning
          - masking 200 from end
          - masking other sites: 21987 21846
        


        python3 scripts/mask-alignment.py             --alignment results/BA/aligned.fasta             --mask-from-beginning 100             --mask-from-end 200             --mask-sites 21987 21846             --mask-terminal-gaps             --output results/BA/masked.fasta 2> logs/mask_BA.txt
        
[Wed Sep 28 12:59:20 2022]
Finished job 27.
16 of 41 steps (39%) done

[Wed Sep 28 12:59:20 2022]
Job 26: Scanning metadata results/BA/metadata_with_nextclade_qc.tsv for problematic sequences. Removing sequences with >20 deviation from the clock and with more than 1.


        python3 scripts/diagnostic.py             --metadata results/BA/metadata_with_nextclade_qc.tsv             --clock-filter 20             --contamination 5             --snp-clusters 1                          --output-exclusion-list results/BA/excluded_by_diagnostics.txt 2>&1 | tee logs/diagnostics_BA.txt
        
[Wed Sep 28 12:59:21 2022]
Finished job 26.
17 of 41 steps (41%) done
[Wed Sep 28 12:59:21 2022]
Finished job 25.
18 of 41 steps (44%) done

[Wed Sep 28 12:59:21 2022]
Job 28: 
        Index sequence composition.
        


        augur index             --sequences results/BA/masked.fasta             --output results/BA/sequence_index.tsv 2>&1 | tee logs/index_sequences_BA.txt
        
[Wed Sep 28 12:59:22 2022]
Finished job 28.
19 of 41 steps (46%) done

[Wed Sep 28 12:59:22 2022]
rule annotate_metadata_with_index:
    input: results/BA/metadata_with_nextclade_qc.tsv, results/BA/sequence_index.tsv
    output: results/BA/metadata_with_index.tsv
    log: logs/annotate_metadata_with_index_BA.txt
    jobid: 22
    benchmark: benchmarks/annotate_metadata_with_index_BA.txt
    wildcards: build_name=BA


        python3 scripts/annotate_metadata_with_index.py             --metadata results/BA/metadata_with_nextclade_qc.tsv             --sequence-index results/BA/sequence_index.tsv             --output results/BA/metadata_with_index.tsv
        
[Wed Sep 28 12:59:23 2022]
Finished job 22.
20 of 41 steps (49%) done

[Wed Sep 28 12:59:23 2022]
Job 7: 
        Adjusting metadata for build 'BA'
        


        python3 scripts/adjust_regional_meta.py             --region global             --metadata results/BA/metadata_with_index.tsv             --output results/BA/metadata_adjusted.tsv.xz 2>&1 | tee logs/adjust_metadata_regions_BA.txt
        

[Wed Sep 28 12:59:23 2022]
Job 21: 
        Filtering alignment results/BA/masked.fasta -> results/BA/filtered.fasta
          - excluding strains in defaults/exclude.txt results/BA/excluded_by_diagnostics.txt
          - including strains in defaults/include.txt
          - min length: --query '(`BA` == '"'"'yes'"'"' & _length >= 25000) | (`Europe` == '"'"'yes'"'"' & _length >= 25000) | (`Global` == '"'"'yes'"'"' & _length >= 25000)'
        


        augur filter             --sequences results/BA/masked.fasta             --metadata results/BA/metadata_with_index.tsv             --include defaults/include.txt             --query '(`BA` == '"'"'yes'"'"' & _length >= 25000) | (`Europe` == '"'"'yes'"'"' & _length >= 25000) | (`Global` == '"'"'yes'"'"' & _length >= 25000)'             --max-date 2022-09-29             --min-date 2021.75             --exclude-ambiguous-dates-by any             --exclude defaults/exclude.txt results/BA/excluded_by_diagnostics.txt             --exclude-where division='USA' purpose_of_sequencing='S dropout'            --output results/BA/filtered.fasta             --output-log results/BA/filtered_log.tsv 2>&1 | tee logs/filtered_BA.txt;
        
[Wed Sep 28 12:59:24 2022]
Finished job 7.
21 of 41 steps (51%) done

[Wed Sep 28 12:59:24 2022]
rule calculate_epiweeks:
    input: results/BA/metadata_adjusted.tsv.xz
    output: results/BA/epiweeks.json
    log: logs/calculate_epiweeks_BA.txt
    jobid: 17
    benchmark: benchmarks/calculate_epiweeks_BA.txt
    wildcards: build_name=BA


        python3 scripts/calculate_epiweek.py             --metadata results/BA/metadata_adjusted.tsv.xz             --metadata-id-columns strain name 'Virus name'             --output-node-data results/BA/epiweeks.json 2>&1 | tee logs/calculate_epiweeks_BA.txt
        

[Wed Sep 28 12:59:24 2022]
Job 12: Use metadata on submission date to construct submission recency field


        python3 scripts/construct-recency-from-submission-date.py             --metadata results/BA/metadata_adjusted.tsv.xz             --output results/BA/recency.json 2>&1 | tee logs/recency_BA.txt
        
[Wed Sep 28 12:59:25 2022]
Finished job 21.
22 of 41 steps (54%) done

[Wed Sep 28 12:59:25 2022]
Job 20: Building tree


        augur tree             --alignment results/BA/filtered.fasta             --tree-builder-args '-ninit 10 -n 4'             --exclude-sites my_profiles/omicron/sites_ignored_for_tree_topology.txt             --output results/BA/tree_raw.nwk             --nthreads 8 2>&1 | tee logs/tree_BA.txt
        
[Wed Sep 28 12:59:25 2022]
Finished job 17.

joverlee · October 6, 2022, 6:05pm

Hi @jonr,

The diagnostics rule produces a log file that should contain more details on why your samples were excluded. This should be logs/diagnostics_BA.txt for your specific build.

If you want to skip the diagnostic step, you can specify this in your build file by adding skip_diagnostics to your filter config:

# remove S dropout sequences and sequences without division label in US
filter:
  exclude_where: "division='USA' purpose_of_sequencing='S dropout'"
  # Removing sequences with lower than 99% coverage
  min_length: 25000 # Default is 28500
  # I don't want Omicron sequences that are before autumn 2021.
  min_date: 2021.75
  # Skip diagnostics rule
  skip_diagnostics: True

jlhudd · October 17, 2022, 9:00pm

Unfortunately, the diagnostics script does not write out the reasons why specific strains appear in the list of excluded strains, so the log file won’t be helpful here. Disabling diagnostics, as @joverlee recommends, is the best approach to get around these excluded sequences. You can then inspect your alignment and tree manually for problematic sequences (or even run the diagnostic script manually) and then exclude those sequences with an exclude file.

jonr · October 18, 2022, 9:53am

Ok, I see. Thanks a lot both!

Topic		Replies	Views
Diagnostic.py or Mask-alignment.py filtered my important sequences Help and Getting Started	0	482	October 8, 2020
Exclusion of forced sequences after augur filter step - seasonalflu build General	4	508	January 30, 2023
Diagnosing error + filtering issues Help and Getting Started	14	1617	November 9, 2020
Sequence missing after certain dates General	5	225	January 16, 2024
Problems creating a SARS-CoV-2 BA.5 build Help and Getting Started	3	436	June 28, 2022

Why do my sequences end up in excluded_by_diagnostics.txt?

Related topics