Hi everyone,
I faced the following error:
Error in rule proximity_score:
jobid: 19
output: results/November_metadata/proximity_country.tsv
log: logs/subsampling_proximity_November_metadata_country.txt (check log file(s) for error message)
shell:
python3 scripts/get_distance_to_focal_set.py --reference defaults/reference_seq.fasta --alignment results/filtered_November-data.fasta.xz --focal-alignment results/November_metadata/sample-country.fasta --ignore-seqs Wuhan/Hu-1/2019 --chunk-size 200000 --output results/November_metadata/proximity_country.tsv 2>&1 | tee logs/subsampling_proximity_November_metadata_country.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/subsampling_proximity_November_metadata_country.txt:
Reading the alignments. 0
Done finding closest matches.
Reading the alignments. 200000
Done finding closest matches.
Reading the alignments. 400000
Done finding closest matches.
Reading the alignments. 600000
Done finding closest matches.
Reading the alignments. 800000
Done finding closest matches.
Reading the alignments. 1000000
Done finding closest matches.
Reading the alignments. 1200000
Done finding closest matches.
Reading the alignments. 1400000
Done finding closest matches.
Reading the alignments. 1600000
Done finding closest matches.
Reading the alignments. 1800000
Done finding closest matches.
Reading the alignments. 2000000
Done finding closest matches.
Reading the alignments. 2200000
Done finding closest matches.
Reading the alignments. 2400000
Done finding closest matches.
Reading the alignments. 2600000
Done finding closest matches.
Reading the alignments. 2800000
Done finding closest matches.
Reading the alignments. 3000000
Done finding closest matches.
Reading the alignments. 3200000
Done finding closest matches.
Traceback (most recent call last):
File "/lustre/scratch2/ws/1/dmse952c-p_cov2muta/ncov_covid/scripts/get_distance_to_focal_set.py", line 174, in <module>
context_seqs_dict = calculate_snp_matrix(seqs, consensus=ref, chunk_size=chunk_size)
File "/lustre/scratch2/ws/1/dmse952c-p_cov2muta/ncov_covid/scripts/get_distance_to_focal_set.py", line 73, in calculate_snp_matrix
raise ValueError('Fasta file appears to have sequences of different lengths!')
ValueError: Fasta file appears to have sequences of different lengths!
Removing output files of failed job proximity_score since they might be corrupted:
results/November_metadata/proximity_country.tsv
I’ve read this discussion, and, as far as I understood, to overcome this problem I have to change something in my builds.yaml file.
Here is my builds.yaml:
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]
inputs:
- name: November-data
sequences: /projects/p_cov2muta/data/sequences.fasta
metadata: /projects/p_cov2muta/data/metadata_latin.tsv
builds:
November_metadata:
subsampling_scheme: custom_division
region: Europe
country: Germany
division: Saxony
traits:
default:
sampling_bias_correction: 2.5
columns: ["country","division"]
files:
auspice_config: "my_profiles/Saxony/auspice_config.json"
description: "my_profiles/example/my_description.md"
subsampling:
custom_division:
division:
group_by: "year month"
max_sequences: 10000
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'region!={region}' 'country!={country}' 'division!={division}'"
country:
group_by: "division year month"
max_sequences: 4400
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'country!={country}' 'division={division}'"
region:
group_by: "country year month"
max_sequences: 3000
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'country={country}' 'region!={region}'"
priorities:
type: "proximity"
focus: "country"
global:
group_by: "country year month"
max_sequences: 1100
sampling_scheme: "--probabilistic-sampling"
exclude: "--exclude-where 'region={region}'"
priorities:
type: "proximity"
focus: "country"
The sequences.fasta and metadata_latin.tsv (the ones that are written in builds.yaml) are files that were downloaded from GISAID.
Sanitize_metadata.py and sanitize_sequences.py steps were done as a part of the workflow and output files are stored in results/ folder.
Could you please help me, what should I change in builds.yaml to solve the problem?
Thank you in advance.
Best,
Dmitrii