Error while generating the nextstrain for only Israel Sequences

Hey,
While generating the nextstrain build for only Israel as a subsampling scheme. I am getting the following error.

Job 17:
determine priority for inclusion in as phylogenetic context by
genetic similiarity to sequences in focal set for build ‘asia_israel’.

    python3 scripts/get_distance_to_focal_set.py             --reference defaults/reference_seq.fasta             --alignment results/filtered_israel-data.fasta.xz             --focal-alignment results/asia_israel/sample-country.fasta             --ignore-seqs Wuhan/Hu-1/2019             --chunk-size 10000             --output results/asia_israel/proximity_country.tsv 2>&1 | tee logs/subsampling_proximity_asia_israel_country.txt

Logfile logs/subsampling_proximity_asia_israel_country.txt:
Traceback (most recent call last):
File “/home/vishwajeet/data/ncov/scripts/get_distance_to_focal_set.py”, line 155, in
focal_seqs_dict = calculate_snp_matrix(focal_seqs, consensus = ref, ignore_seqs=args.ignore_seqs)
File “/home/vishwajeet/data/ncov/scripts/get_distance_to_focal_set.py”, line 73, in calculate_snp_matrix
raise ValueError(‘Fasta file appears to have sequences of different lengths!’)
ValueError: Fasta file appears to have sequences of different lengths!

I have used sanitize sequences.py and sanitize metadata.py before the run.

What should I do ?

Hi @vrmarathe - can you share the builds.yaml file you are running the workflow with?

I would double check that the sequences in results/filtered_israel-data.fasta.xz and results/asia_israel/sample-country.fasta have sequences in them and the sequences there are of the same length. (They should be, but worth double checking this.)

echoing James, I would check whether both input files are aligned to the reference (as opposed to raw sequences) and have length 29903.

Hi @james, I am using the following file, I am using msa_0908.fasta (aligned file from GISAID) as a input. aligned: data/sequences_gisaid.fasta.gz. I have used the cleaning script from nextstrain to generate this.

# Define input files.
inputs:
  - name: israel-data
    metadata: data/metadata_gisaid.tsv.gz
    aligned: data/sequences_gisaid.fasta.gz

builds:

  asia_israel:
    subsampling_scheme: country
    region: Asia
    country: Israel 
    # Here, USA is in North America
files:
  auspice_config: "my_profiles/example/my_auspice_config.json"
  description: "my_profiles/example/my_description.md"

What should I do to correct ?

Hi @vrmarathe - the way you have defined your inputs is for aligned sequences, but from the error you are getting I’m guessing they are unaligned sequences. I would try changing that section to:

inputs:
  - name: israel-data
    metadata: data/metadata_gisaid.tsv.gz
    sequences: data/sequences_gisaid.fasta.gz

Hi @james , I have sucessfully ran nextstrain for israel. Thank your for the help.

Is there a way to skip the tree-building for other countries. I am just looking to get the mutation data like amino acid mutations and nucleotide mutations ?

Sure - there are a bunch of ways you could get this information - you could stop this pipeline at the relevant step – alignment / masking / filtering, depending on the exact data you want – and examine the alignments; we are using nextalign for this. Alternatively you could use nextclade (this also uses nextalign internally).