Error in rule proximity_score: Fasta file appears to have sequences of different lengths!

Hi everyone,

I faced the following error:

Error in rule proximity_score:
    jobid: 19
    output: results/November_metadata/proximity_country.tsv
    log: logs/subsampling_proximity_November_metadata_country.txt (check log file(s) for error message)
    shell:
        
        python3 scripts/get_distance_to_focal_set.py             --reference defaults/reference_seq.fasta             --alignment results/filtered_November-data.fasta.xz             --focal-alignment results/November_metadata/sample-country.fasta             --ignore-seqs Wuhan/Hu-1/2019             --chunk-size 200000             --output results/November_metadata/proximity_country.tsv 2>&1 | tee logs/subsampling_proximity_November_metadata_country.txt
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile logs/subsampling_proximity_November_metadata_country.txt:
Reading the alignments. 0
Done finding closest matches.
Reading the alignments. 200000
Done finding closest matches.
Reading the alignments. 400000
Done finding closest matches.
Reading the alignments. 600000
Done finding closest matches.
Reading the alignments. 800000
Done finding closest matches.
Reading the alignments. 1000000
Done finding closest matches.
Reading the alignments. 1200000
Done finding closest matches.
Reading the alignments. 1400000
Done finding closest matches.
Reading the alignments. 1600000
Done finding closest matches.
Reading the alignments. 1800000
Done finding closest matches.
Reading the alignments. 2000000
Done finding closest matches.
Reading the alignments. 2200000
Done finding closest matches.
Reading the alignments. 2400000
Done finding closest matches.
Reading the alignments. 2600000
Done finding closest matches.
Reading the alignments. 2800000
Done finding closest matches.
Reading the alignments. 3000000
Done finding closest matches.
Reading the alignments. 3200000
Done finding closest matches.
Traceback (most recent call last):
  File "/lustre/scratch2/ws/1/dmse952c-p_cov2muta/ncov_covid/scripts/get_distance_to_focal_set.py", line 174, in <module>
    context_seqs_dict = calculate_snp_matrix(seqs, consensus=ref, chunk_size=chunk_size)
  File "/lustre/scratch2/ws/1/dmse952c-p_cov2muta/ncov_covid/scripts/get_distance_to_focal_set.py", line 73, in calculate_snp_matrix
    raise ValueError('Fasta file appears to have sequences of different lengths!')
ValueError: Fasta file appears to have sequences of different lengths!


Removing output files of failed job proximity_score since they might be corrupted:
results/November_metadata/proximity_country.tsv

I’ve read this discussion, and, as far as I understood, to overcome this problem I have to change something in my builds.yaml file.

Here is my builds.yaml:

genes: ["ORF1a", "ORF1b", "S", "ORF3a", "M", "N"]

inputs:
  - name: November-data
    sequences: /projects/p_cov2muta/data/sequences.fasta
    metadata: /projects/p_cov2muta/data/metadata_latin.tsv



builds:

  November_metadata:
    subsampling_scheme: custom_division
    region: Europe
    country: Germany
    division: Saxony

traits:
  default:
     sampling_bias_correction: 2.5
     columns: ["country","division"]

files:
  auspice_config: "my_profiles/Saxony/auspice_config.json"
  description: "my_profiles/example/my_description.md"

subsampling:
  custom_division:
    division:
      group_by: "year month"
      max_sequences: 10000
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'region!={region}' 'country!={country}' 'division!={division}'"
    country:
      group_by: "division year month"
      max_sequences: 4400
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'country!={country}' 'division={division}'"
    region:
      group_by: "country year month"
      max_sequences: 3000
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"
    global:
      group_by: "country year month"
      max_sequences: 1100
      sampling_scheme: "--probabilistic-sampling"
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"

The sequences.fasta and metadata_latin.tsv (the ones that are written in builds.yaml) are files that were downloaded from GISAID.
Sanitize_metadata.py and sanitize_sequences.py steps were done as a part of the workflow and output files are stored in results/ folder.

Could you please help me, what should I change in builds.yaml to solve the problem?

Thank you in advance.

Best,
Dmitrii

Hmm. Nothing immediately stands out as incorrect in that config. Could you double check that the sequences in defaults/reference_seq.fasta, results/filtered_November-data.fasta.xz, results/November_metadata/sample-country.fasta are all the same length?

Hi @james,

I’ve checked the files.
The sequence length in defaults/reference_seq.fasta: 29903
The sequences length in results/November_metadata/sample-country.fasta: 29903
The sequences length in results/filtered_November-data.fasta.xz: 29903 and 29509

There is only one sequence with the shorter length:
>USA/MN-CDC-IBX251965809732/2021: 29509

Should I just exclude this sequence from results/filtered_November-data.fasta.xz?

Best,
Dmitrii

Hi Dmitrii,

the sequences in results/filtered_November-data.fasta.xz, how have they been aligned?

best,
richard

Hi Richard,

I got results/filtered_November-data.fasta.xz like this:

Job 9: 
        Filtering alignment results/masked_November-data.fasta.xz -> results/filtered_November-data.fasta.xz
          - excluding strains in defaults/exclude.txt results/to-exclude_November-data.txt
          - including strains in defaults/include.txt
          - min length: 27000
        


        augur filter             --sequences results/masked_November-data.fasta.xz             --metadata results/sanitized_metadata_November-data.tsv.xz             --include defaults/include.txt             --max-date 2021-11-10             --min-date 2019.74             --exclude-ambiguous-dates-by any             --exclude defaults/exclude.txt results/to-exclude_November-data.txt             --exclude-where division='USA'            --min-length 27000             --output results/filtered_November-data.fasta 2>&1 | tee logs/filtered_November-data.txt;
        xz -2 results/filtered_November-data.fasta

Alignment was done in the following way:

Job 11: 
        Aligning sequences to defaults/reference_seq.fasta
            - gaps relative to reference are considered real
        


        python3 scripts/sanitize_sequences.py             --sequences /projects/p_cov2muta/data/sequences.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_November-data.txt             | nextalign             --jobs=28             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --genes ORF1a,ORF1b,S,ORF3a,M,N             --sequences /dev/stdin             --output-dir results/translations             --output-basename seqs_November-data             --output-fasta results/aligned_November-data.fasta             --output-insertions results/insertions_November-data.tsv > logs/align_November-data.txt 2>&1;
        xz -2 results/aligned_November-data.fasta;
        xz -2 results/translations/seqs_November-data*.fasta

As you can see everything was done according to nextstrain workflow.

Best,
Dmitrii

Should I just exclude this sequence from results/filtered_November-data.fasta.xz ?

In the short term, yes. This should allow the pipeline to proceed. Longer term we need to work out what’s going on… could you tell us the nextalign version you are using (nextalign --version)?

The version of nextalign is 1.4.0

Could you check whether the results/aligned_November-data.fasta is all the same length?

There is only one sequence with different length in results/aligned_November-data.fasta:

USA/MN-CDC-IBX251965809732/2021
29509

And after the removal of this sequence from the files everything works fine.