Hi,
I am running a build with BA.2.75 sequences. I combine three datasets, two from Gisiad and one that I made myself, but for reasons I can’t understand all of my sequences are filtered out. They are not removed due to some query, but are removed by diagnostic.py. I tried to tweak the parameters --clock-filter
, --contamination and --snp-clusters 1
but nothings seems to change…
Here’s and example of one of my sequences from the log-file:
SWIFT252231133 filter_by_exclude "[[""exclude_file"", ""results/BA/excluded_by_diagnostics.txt""]]"
How can I better understand why my sequences gets filtered out by diagnostic.py? And can I skip this step altoghether?
Here is my build-file (the BA file contains my filtered out sequences):
use_nextalign: true
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF9b"]
inputs:
- name: "BA"
metadata: data/BA.2.75/2022-09-28_BA.tsv
sequences: data/BA.2.75/2022-09-28_BA.fasta
- name: "Europe"
metadata: data/BA.2.75/hcov_europe.tsv
sequences: data/BA.2.75/hcov_europe.fasta
- name: "Global"
metadata: data/BA.2.75/hcov_global.tsv
sequences: data/BA.2.75/hcov_global.fasta
builds:
BA:
subsampling_scheme: BA_scheme
subsampling:
BA_scheme:
allFromBA:
exclude: "--exclude-where 'BA!=yes'"
europeanBackground:
query: --query "(Europe=='yes') & (country!='Norway')" # Do not include Norwegian samples
group_by: "month"
seq_per_group: 10
globalBackground:
query: --query "(Global=='yes') & (region!='Europe')" # Do not include European samples
group_by: "month"
seq_per_group: 5
# remove S dropout sequences and sequences without division label in US
filter:
exclude_where: "division='USA' purpose_of_sequencing='S dropout'"
# Removing sequences with lower than 99% coverage
min_length: 25000 # Default is 28500
# I don't want Omicron sequences that are before autumn 2021.
min_date: 2021.75
# Define frequencies parameters.
frequencies:
recent_days_to_censor: 7
# Here, you can specify what type of auspice_config you want to use
# and what description you want. These will apply to all the above builds.
# If you want to specify specific files for each build - you can!
# See the 'example_advanced_customization' builds.yaml
# Change "columns" to "location" for augur traits when having only a single country
diagnostic:
skip_inputs_arg=_get_skipped_inputs_for_diagnostic: true
clock_filter: 20 # 20 default
snp_clusters: 1 # 1 default
contamination: 5 # 5 default
traits:
default:
sampling_bias_correction: 2.5
columns: ["country"]
# Change root to South African omicron / BEST
refine:
# root: "SouthAfrica/NICD-N22102/2021"
root: "best"
files:
auspice_config: "my_profiles/my_auspice_config_BA.json"
description: "my_profiles/omicron/my_description.md"
colors: "my_profiles/omicron/colors_norwaydivisions.tsv"
sites_to_mask: "my_profiles/omicron/sites_ignored_for_tree_topology.txt"
And here is the snakemake log (skipped some parts due to character limitation):
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 12
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 add_branch_labels
1 adjust_metadata_regions
3 align
1 all
1 ancestral
1 annotate_metadata_with_index
1 build_align
1 build_description
1 calculate_epiweeks
1 clade_files
1 clades
1 combine_input_metadata
1 combine_samples
1 combine_sequences_for_subsampling
1 diagnostic
1 distances
1 emerging_lineages
1 export
1 filter
1 finalize
1 include_hcov19_prefix
1 index
1 join_metadata_and_nextclade_qc
1 logistic_growth
1 mask
1 mutational_fitness
1 recency
1 refine
1 rename_emerging_lineages
3 sanitize_metadata
3 subsample
1 tip_frequencies
1 traits
1 translate
1 tree
41
[Wed Sep 28 12:59:04 2022]
rule sanitize_metadata:
input: data/BA.2.75/2022-09-28_BA.tsv
output: results/sanitized_metadata_BA.tsv.xz
log: logs/sanitize_metadata_BA.txt
jobid: 39
benchmark: benchmarks/sanitize_metadata_BA.txt
wildcards: origin=BA
resources: mem_mb=2000
python3 scripts/sanitize_metadata.py --metadata data/BA.2.75/2022-09-28_BA.tsv --metadata-id-columns strain name 'Virus name' --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession --parse-location-field Location --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content --strip-prefixes hCoV-19/ SARS-CoV-2/ --output results/sanitized_metadata_BA.tsv.xz 2>&1 | tee logs/sanitize_metadata_BA.txt
[Wed Sep 28 12:59:04 2022]
Job 38:
Aligning sequences to defaults/reference_seq.fasta
- gaps relative to reference are considered real
python3 scripts/sanitize_sequences.py --sequences data/BA.2.75/hcov_global.fasta --strip-prefixes hCoV-19/ SARS-CoV-2/ --output /dev/stdout 2> logs/sanitize_sequences_Global.txt | nextalign run --jobs=8 --reference defaults/reference_seq.fasta --genemap defaults/annotation.gff --output-translations results/translations/seqs_Global.gene.{gene}.fasta --output-fasta results/aligned_Global.fasta --output-insertions results/insertions_Global.tsv > logs/align_Global.txt 2>&1;
xz -2 -T 8 results/aligned_Global.fasta;
xz -2 -T 8 results/translations/seqs_Global.gene.*.fasta
[Wed Sep 28 12:59:04 2022]
rule sanitize_metadata:
input: data/BA.2.75/hcov_europe.tsv
output: results/sanitized_metadata_Europe.tsv.xz
log: logs/sanitize_metadata_Europe.txt
jobid: 40
benchmark: benchmarks/sanitize_metadata_Europe.txt
wildcards: origin=Europe
resources: mem_mb=2000
python3 scripts/sanitize_metadata.py --metadata data/BA.2.75/hcov_europe.tsv --metadata-id-columns strain name 'Virus name' --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession --parse-location-field Location --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content --strip-prefixes hCoV-19/ SARS-CoV-2/ --output results/sanitized_metadata_Europe.tsv.xz 2>&1 | tee logs/sanitize_metadata_Europe.txt
[Wed Sep 28 12:59:04 2022]
rule clade_files:
input: defaults/clades.tsv
output: results/BA/clades.tsv
jobid: 24
benchmark: benchmarks/clade_files_BA.txt
wildcards: build_name=BA
cat defaults/clades.tsv > results/BA/clades.tsv
[Wed Sep 28 12:59:04 2022]
rule sanitize_metadata:
input: data/BA.2.75/hcov_global.tsv
output: results/sanitized_metadata_Global.tsv.xz
log: logs/sanitize_metadata_Global.txt
jobid: 41
benchmark: benchmarks/sanitize_metadata_Global.txt
wildcards: origin=Global
resources: mem_mb=2000
python3 scripts/sanitize_metadata.py --metadata data/BA.2.75/hcov_global.tsv --metadata-id-columns strain name 'Virus name' --database-id-columns 'Accession ID' gisaid_epi_isl genbank_accession --parse-location-field Location --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' pangolin_lineage=pango_lineage Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aaSubstitutions' 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content --strip-prefixes hCoV-19/ SARS-CoV-2/ --output results/sanitized_metadata_Global.tsv.xz 2>&1 | tee logs/sanitize_metadata_Global.txt
[Wed Sep 28 12:59:04 2022]
Finished job 24.
1 of 41 steps (2%) done
[Wed Sep 28 12:59:04 2022]
Job 18: Templating build description for Auspice
[Wed Sep 28 12:59:04 2022]
Finished job 18.
2 of 41 steps (5%) done
[Wed Sep 28 12:59:05 2022]
Finished job 39.
3 of 41 steps (7%) done
[Wed Sep 28 12:59:05 2022]
Finished job 40.
4 of 41 steps (10%) done
[Wed Sep 28 12:59:05 2022]
Finished job 41.
5 of 41 steps (12%) done
[Wed Sep 28 12:59:05 2022]
Job 32:
Combining metadata files results/sanitized_metadata_BA.tsv.xz results/sanitized_metadata_Europe.tsv.xz results/sanitized_metadata_Global.tsv.xz -> results/combined_metadata.tsv.xz and adding columns to represent origin
python3 scripts/combine_metadata.py --metadata results/sanitized_metadata_BA.tsv.xz results/sanitized_metadata_Europe.tsv.xz results/sanitized_metadata_Global.tsv.xz --origins BA Europe Global --output results/combined_metadata.tsv.xz 2>&1 | tee logs/combine_input_metadata.txt
[Wed Sep 28 12:59:07 2022]
Finished job 32.
6 of 41 steps (15%) done
[Wed Sep 28 12:59:07 2022]
Job 35:
Subsample all sequences by 'globalBackground' scheme for build 'BA' with the following parameters:
- group by: --group-by month
- sequences per group: --sequences-per-group 5
- subsample max sequences:
- min-date:
- max-date:
-
- exclude:
- include:
- query: --query '(Global=='"'"'yes'"'"') & (region!='"'"'Europe'"'"')'
- priority:
augur filter --metadata results/combined_metadata.tsv.xz --include defaults/include.txt --exclude defaults/exclude.txt --query '(Global=='"'"'yes'"'"') & (region!='"'"'Europe'"'"')' --group-by month --sequences-per-group 5 --output-strains results/BA/sample-globalBackground.txt 2>&1 | tee logs/subsample_BA_globalBackground.txt
[Wed Sep 28 12:59:07 2022]
Job 34:
Subsample all sequences by 'europeanBackground' scheme for build 'BA' with the following parameters:
- group by: --group-by month
- sequences per group: --sequences-per-group 10
- subsample max sequences:
- min-date:
- max-date:
-
- exclude:
- include:
- query: --query '(Europe=='"'"'yes'"'"') & (country!='"'"'Norway'"'"')'
- priority:
augur filter --metadata results/combined_metadata.tsv.xz --include defaults/include.txt --exclude defaults/exclude.txt --query '(Europe=='"'"'yes'"'"') & (country!='"'"'Norway'"'"')' --group-by month --sequences-per-group 10 --output-strains results/BA/sample-europeanBackground.txt 2>&1 | tee logs/subsample_BA_europeanBackground.txt
[Wed Sep 28 12:59:07 2022]
Job 33:
Subsample all sequences by 'allFromBA' scheme for build 'BA' with the following parameters:
- group by:
- sequences per group:
- subsample max sequences:
- min-date:
- max-date:
-
- exclude: --exclude-where 'BA!=yes'
- include:
- query:
- priority:
augur filter --metadata results/combined_metadata.tsv.xz --include defaults/include.txt --exclude defaults/exclude.txt --exclude-where 'BA!=yes' --output-strains results/BA/sample-allFromBA.txt 2>&1 | tee logs/subsample_BA_allFromBA.txt
[Wed Sep 28 12:59:09 2022]
Finished job 33.
7 of 41 steps (17%) done
[Wed Sep 28 12:59:09 2022]
Finished job 34.
8 of 41 steps (20%) done
[Wed Sep 28 12:59:09 2022]
Finished job 35.
9 of 41 steps (22%) done
[Wed Sep 28 12:59:10 2022]
Finished job 38.
10 of 41 steps (24%) done
[Wed Sep 28 12:59:10 2022]
Job 37:
Aligning sequences to defaults/reference_seq.fasta
- gaps relative to reference are considered real
python3 scripts/sanitize_sequences.py --sequences data/BA.2.75/hcov_europe.fasta --strip-prefixes hCoV-19/ SARS-CoV-2/ --output /dev/stdout 2> logs/sanitize_sequences_Europe.txt | nextalign run --jobs=8 --reference defaults/reference_seq.fasta --genemap defaults/annotation.gff --output-translations results/translations/seqs_Europe.gene.{gene}.fasta --output-fasta results/aligned_Europe.fasta --output-insertions results/insertions_Europe.tsv > logs/align_Europe.txt 2>&1;
xz -2 -T 8 results/aligned_Europe.fasta;
xz -2 -T 8 results/translations/seqs_Europe.gene.*.fasta
[Wed Sep 28 12:59:13 2022]
Finished job 37.
11 of 41 steps (27%) done
[Wed Sep 28 12:59:13 2022]
Job 36:
Aligning sequences to defaults/reference_seq.fasta
- gaps relative to reference are considered real
python3 scripts/sanitize_sequences.py --sequences data/BA.2.75/2022-09-28_BA.fasta --strip-prefixes hCoV-19/ SARS-CoV-2/ --output /dev/stdout 2> logs/sanitize_sequences_BA.txt | nextalign run --jobs=8 --reference defaults/reference_seq.fasta --genemap defaults/annotation.gff --output-translations results/translations/seqs_BA.gene.{gene}.fasta --output-fasta results/aligned_BA.fasta --output-insertions results/insertions_BA.tsv > logs/align_BA.txt 2>&1;
xz -2 -T 8 results/aligned_BA.fasta;
xz -2 -T 8 results/translations/seqs_BA.gene.*.fasta
[Wed Sep 28 12:59:14 2022]
Finished job 36.
12 of 41 steps (29%) done
[Wed Sep 28 12:59:14 2022]
Job 31:
Combine and deduplicate aligned FASTAs from multiple origins in preparation for subsampling.
python3 scripts/sanitize_sequences.py --sequences results/aligned_BA.fasta.xz results/aligned_Europe.fasta.xz results/aligned_Global.fasta.xz --strip-prefixes hCoV-19/ SARS-CoV-2/ --output /dev/stdout | xz -c -2 > results/combined_sequences_for_subsampling.fasta.xz
[Wed Sep 28 12:59:16 2022]
Finished job 31.
13 of 41 steps (32%) done
[Wed Sep 28 12:59:16 2022]
Job 29:
Combine and deduplicate FASTAs
augur filter --sequences results/combined_sequences_for_subsampling.fasta.xz --metadata results/combined_metadata.tsv.xz --exclude-all --include results/BA/sample-allFromBA.txt results/BA/sample-europeanBackground.txt results/BA/sample-globalBackground.txt --output-sequences results/BA/BA_subsampled_sequences.fasta.xz --output-metadata results/BA/BA_subsampled_metadata.tsv.xz 2>&1 | tee logs/subsample_regions_BA.txt
[Wed Sep 28 12:59:18 2022]
Finished job 29.
14 of 41 steps (34%) done
[Wed Sep 28 12:59:18 2022]
Job 23:
Running nextclade QC and aligning sequences
- gaps relative to reference are considered real
python3 scripts/sanitize_sequences.py --sequences results/BA/BA_subsampled_sequences.fasta.xz --strip-prefixes hCoV-19/ SARS-CoV-2/ --output /dev/stdout 2> logs/sanitize_sequences_before_nextclade_BA.txt | nextclade run --jobs 8 --input-dataset data/sars-cov-2-nextclade-defaults.zip --output-tsv results/BA/nextclade_qc.tsv --output-fasta results/BA/aligned.fasta --output-translations results/BA/translations/aligned.gene.{gene}.fasta --output-insertions results/BA/insertions.tsv 2>&1 | tee logs/align_BA.txt
[Wed Sep 28 12:59:19 2022]
Finished job 23.
15 of 41 steps (37%) done
[Wed Sep 28 12:59:20 2022]
rule join_metadata_and_nextclade_qc:
input: results/BA/BA_subsampled_metadata.tsv.xz, results/BA/nextclade_qc.tsv
output: results/BA/metadata_with_nextclade_qc.tsv
log: logs/join_metadata_and_nextclade_qc_BA.txt
jobid: 27
benchmark: benchmarks/join_metadata_and_nextclade_qc_BA.txt
wildcards: build_name=BA
python3 scripts/join-metadata-and-clades.py results/BA/BA_subsampled_metadata.tsv.xz results/BA/nextclade_qc.tsv -o results/BA/metadata_with_nextclade_qc.tsv 2>&1 | tee logs/join_metadata_and_nextclade_qc_BA.txt
[Wed Sep 28 12:59:20 2022]
Job 25:
Mask bases in alignment results/BA/aligned.fasta
- masking 100 from beginning
- masking 200 from end
- masking other sites: 21987 21846
python3 scripts/mask-alignment.py --alignment results/BA/aligned.fasta --mask-from-beginning 100 --mask-from-end 200 --mask-sites 21987 21846 --mask-terminal-gaps --output results/BA/masked.fasta 2> logs/mask_BA.txt
[Wed Sep 28 12:59:20 2022]
Finished job 27.
16 of 41 steps (39%) done
[Wed Sep 28 12:59:20 2022]
Job 26: Scanning metadata results/BA/metadata_with_nextclade_qc.tsv for problematic sequences. Removing sequences with >20 deviation from the clock and with more than 1.
python3 scripts/diagnostic.py --metadata results/BA/metadata_with_nextclade_qc.tsv --clock-filter 20 --contamination 5 --snp-clusters 1 --output-exclusion-list results/BA/excluded_by_diagnostics.txt 2>&1 | tee logs/diagnostics_BA.txt
[Wed Sep 28 12:59:21 2022]
Finished job 26.
17 of 41 steps (41%) done
[Wed Sep 28 12:59:21 2022]
Finished job 25.
18 of 41 steps (44%) done
[Wed Sep 28 12:59:21 2022]
Job 28:
Index sequence composition.
augur index --sequences results/BA/masked.fasta --output results/BA/sequence_index.tsv 2>&1 | tee logs/index_sequences_BA.txt
[Wed Sep 28 12:59:22 2022]
Finished job 28.
19 of 41 steps (46%) done
[Wed Sep 28 12:59:22 2022]
rule annotate_metadata_with_index:
input: results/BA/metadata_with_nextclade_qc.tsv, results/BA/sequence_index.tsv
output: results/BA/metadata_with_index.tsv
log: logs/annotate_metadata_with_index_BA.txt
jobid: 22
benchmark: benchmarks/annotate_metadata_with_index_BA.txt
wildcards: build_name=BA
python3 scripts/annotate_metadata_with_index.py --metadata results/BA/metadata_with_nextclade_qc.tsv --sequence-index results/BA/sequence_index.tsv --output results/BA/metadata_with_index.tsv
[Wed Sep 28 12:59:23 2022]
Finished job 22.
20 of 41 steps (49%) done
[Wed Sep 28 12:59:23 2022]
Job 7:
Adjusting metadata for build 'BA'
python3 scripts/adjust_regional_meta.py --region global --metadata results/BA/metadata_with_index.tsv --output results/BA/metadata_adjusted.tsv.xz 2>&1 | tee logs/adjust_metadata_regions_BA.txt
[Wed Sep 28 12:59:23 2022]
Job 21:
Filtering alignment results/BA/masked.fasta -> results/BA/filtered.fasta
- excluding strains in defaults/exclude.txt results/BA/excluded_by_diagnostics.txt
- including strains in defaults/include.txt
- min length: --query '(`BA` == '"'"'yes'"'"' & _length >= 25000) | (`Europe` == '"'"'yes'"'"' & _length >= 25000) | (`Global` == '"'"'yes'"'"' & _length >= 25000)'
augur filter --sequences results/BA/masked.fasta --metadata results/BA/metadata_with_index.tsv --include defaults/include.txt --query '(`BA` == '"'"'yes'"'"' & _length >= 25000) | (`Europe` == '"'"'yes'"'"' & _length >= 25000) | (`Global` == '"'"'yes'"'"' & _length >= 25000)' --max-date 2022-09-29 --min-date 2021.75 --exclude-ambiguous-dates-by any --exclude defaults/exclude.txt results/BA/excluded_by_diagnostics.txt --exclude-where division='USA' purpose_of_sequencing='S dropout' --output results/BA/filtered.fasta --output-log results/BA/filtered_log.tsv 2>&1 | tee logs/filtered_BA.txt;
[Wed Sep 28 12:59:24 2022]
Finished job 7.
21 of 41 steps (51%) done
[Wed Sep 28 12:59:24 2022]
rule calculate_epiweeks:
input: results/BA/metadata_adjusted.tsv.xz
output: results/BA/epiweeks.json
log: logs/calculate_epiweeks_BA.txt
jobid: 17
benchmark: benchmarks/calculate_epiweeks_BA.txt
wildcards: build_name=BA
python3 scripts/calculate_epiweek.py --metadata results/BA/metadata_adjusted.tsv.xz --metadata-id-columns strain name 'Virus name' --output-node-data results/BA/epiweeks.json 2>&1 | tee logs/calculate_epiweeks_BA.txt
[Wed Sep 28 12:59:24 2022]
Job 12: Use metadata on submission date to construct submission recency field
python3 scripts/construct-recency-from-submission-date.py --metadata results/BA/metadata_adjusted.tsv.xz --output results/BA/recency.json 2>&1 | tee logs/recency_BA.txt
[Wed Sep 28 12:59:25 2022]
Finished job 21.
22 of 41 steps (54%) done
[Wed Sep 28 12:59:25 2022]
Job 20: Building tree
augur tree --alignment results/BA/filtered.fasta --tree-builder-args '-ninit 10 -n 4' --exclude-sites my_profiles/omicron/sites_ignored_for_tree_topology.txt --output results/BA/tree_raw.nwk --nthreads 8 2>&1 | tee logs/tree_BA.txt
[Wed Sep 28 12:59:25 2022]
Finished job 17.