Error in rule proximity_score: Fasta file appears to have sequences of different lengths!

qwerty123 · November 10, 2021, 11:30pm

Hi everyone,

I faced the following error:

Error in rule proximity_score:
    jobid: 19
    output: results/November_metadata/proximity_country.tsv
    log: logs/subsampling_proximity_November_metadata_country.txt (check log file(s) for error message)
    shell:
        
        python3 scripts/get_distance_to_focal_set.py             --reference defaults/reference_seq.fasta             --alignment results/filtered_November-data.fasta.xz             --focal-alignment results/November_metadata/sample-country.fasta             --ignore-seqs Wuhan/Hu-1/2019             --chunk-size 200000             --output results/November_metadata/proximity_country.tsv 2>&1 | tee logs/subsampling_proximity_November_metadata_country.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/subsampling_proximity_November_metadata_country.txt:
Reading the alignments. 0
Done finding closest matches.
Reading the alignments. 200000
Done finding closest matches.
Reading the alignments. 400000
Done finding closest matches.
Reading the alignments. 600000
Done finding closest matches.
Reading the alignments. 800000
Done finding closest matches.
Reading the alignments. 1000000
Done finding closest matches.
Reading the alignments. 1200000
Done finding closest matches.
Reading the alignments. 1400000
Done finding closest matches.
Reading the alignments. 1600000
Done finding closest matches.
Reading the alignments. 1800000
Done finding closest matches.
Reading the alignments. 2000000
Done finding closest matches.
Reading the alignments. 2200000
Done finding closest matches.
Reading the alignments. 2400000
Done finding closest matches.
Reading the alignments. 2600000
Done finding closest matches.
Reading the alignments. 2800000
Done finding closest matches.
Reading the alignments. 3000000
Done finding closest matches.
Reading the alignments. 3200000
Done finding closest matches.
Traceback (most recent call last):
  File "/lustre/scratch2/ws/1/dmse952c-p_cov2muta/ncov_covid/scripts/get_distance_to_focal_set.py", line 174, in <module>
    context_seqs_dict = calculate_snp_matrix(seqs, consensus=ref, chunk_size=chunk_size)
  File "/lustre/scratch2/ws/1/dmse952c-p_cov2muta/ncov_covid/scripts/get_distance_to_focal_set.py", line 73, in calculate_snp_matrix
    raise ValueError('Fasta file appears to have sequences of different lengths!')
ValueError: Fasta file appears to have sequences of different lengths!


Removing output files of failed job proximity_score since they might be corrupted:
results/November_metadata/proximity_country.tsv

I’ve read this discussion, and, as far as I understood, to overcome this problem I have to change something in my builds.yaml file.

Here is my builds.yaml:

genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]

inputs:
  - name: November-data
    sequences: /projects/p_cov2muta/data/sequences.fasta
    metadata: /projects/p_cov2muta/data/metadata_latin.tsv



builds:

  November_metadata:
    subsampling_scheme: custom_division
    region: Europe
    country: Germany
    division: Saxony

traits:
  default:
     sampling_bias_correction: 2.5
     columns: ["country","division"]

files:
  auspice_config: "my_profiles/Saxony/auspice_config.json"
  description: "my_profiles/example/my_description.md"

subsampling:
  custom_division:
    division:
      group_by: "year month"
      max_sequences: 10000
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'region!={region}' 'country!={country}' 'division!={division}'"
    country:
      group_by: "division year month"
      max_sequences: 4400
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'country!={country}' 'division={division}'"
    region:
      group_by: "country year month"
      max_sequences: 3000
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"
    global:
      group_by: "country year month"
      max_sequences: 1100
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"

The sequences.fasta and metadata_latin.tsv (the ones that are written in builds.yaml) are files that were downloaded from GISAID.
Sanitize_metadata.py and sanitize_sequences.py steps were done as a part of the workflow and output files are stored in results/ folder.

Could you please help me, what should I change in builds.yaml to solve the problem?

Thank you in advance.

Best,
Dmitrii

james · November 11, 2021, 2:20am

Hmm. Nothing immediately stands out as incorrect in that config. Could you double check that the sequences in defaults/reference_seq.fasta, results/filtered_November-data.fasta.xz, results/November_metadata/sample-country.fasta are all the same length?

qwerty123 · November 11, 2021, 10:42am

Hi @james,

I’ve checked the files.
The sequence length in defaults/reference_seq.fasta: 29903
The sequences length in results/November_metadata/sample-country.fasta: 29903
The sequences length in results/filtered_November-data.fasta.xz: 29903 and 29509

There is only one sequence with the shorter length:
>USA/MN-CDC-IBX251965809732/2021: 29509

Should I just exclude this sequence from results/filtered_November-data.fasta.xz?

Best,
Dmitrii

rneher · November 11, 2021, 11:02am

Hi Dmitrii,

the sequences in results/filtered_November-data.fasta.xz, how have they been aligned?

best,
richard

qwerty123 · November 11, 2021, 7:56pm

Hi Richard,

I got results/filtered_November-data.fasta.xz like this:

Job 9: 
        Filtering alignment results/masked_November-data.fasta.xz -> results/filtered_November-data.fasta.xz
          - excluding strains in defaults/exclude.txt results/to-exclude_November-data.txt
          - including strains in defaults/include.txt
          - min length: 27000
        


        augur filter             --sequences results/masked_November-data.fasta.xz             --metadata results/sanitized_metadata_November-data.tsv.xz             --include defaults/include.txt             --max-date 2021-11-10             --min-date 2019.74             --exclude-ambiguous-dates-by any             --exclude defaults/exclude.txt results/to-exclude_November-data.txt             --exclude-where division='USA'            --min-length 27000             --output results/filtered_November-data.fasta 2>&1 | tee logs/filtered_November-data.txt;
        xz -2 results/filtered_November-data.fasta

Alignment was done in the following way:

Job 11: 
        Aligning sequences to defaults/reference_seq.fasta
            - gaps relative to reference are considered real
        


        python3 scripts/sanitize_sequences.py             --sequences /projects/p_cov2muta/data/sequences.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_November-data.txt             | nextalign             --jobs=28             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --genes ORF1a,ORF1b,S,ORF3a,M,N             --sequences /dev/stdin             --output-dir results/translations             --output-basename seqs_November-data             --output-fasta results/aligned_November-data.fasta             --output-insertions results/insertions_November-data.tsv > logs/align_November-data.txt 2>&1;
        xz -2 results/aligned_November-data.fasta;
        xz -2 results/translations/seqs_November-data*.fasta

As you can see everything was done according to nextstrain workflow.

Best,
Dmitrii

james · November 11, 2021, 8:05pm

Should I just exclude this sequence from results/filtered_November-data.fasta.xz ?

In the short term, yes. This should allow the pipeline to proceed. Longer term we need to work out what’s going on… could you tell us the nextalign version you are using (nextalign --version)?

qwerty123 · November 11, 2021, 8:43pm

The version of nextalign is 1.4.0

rneher · November 13, 2021, 8:57am

Could you check whether the results/aligned_November-data.fasta is all the same length?

qwerty123 · November 14, 2021, 1:34pm

There is only one sequence with different length in results/aligned_November-data.fasta:

USA/MN-CDC-IBX251965809732/2021
29509

And after the removal of this sequence from the files everything works fine.

Topic		Replies	Views
Error while generating the nextstrain for only Israel Sequences Help and Getting Started	6	481	November 3, 2021
Error in rule align- sequence length Help and Getting Started	9	619	July 12, 2021
Error in rule proximity_score Help and Getting Started	6	820	November 24, 2020
Error: Alignment must have at least 3 sequences Help and Getting Started	9	1042	November 7, 2024
Error in rule align: argument '--jobs' Help and Getting Started	7	458	February 28, 2023

Error in rule proximity_score: Fasta file appears to have sequences of different lengths!

Related topics