Error in rule align- sequence length

Hello,
I am trying to perform a global subsampling for Colombia using the GISAID data. However, after running for 2.5-3.00 hrs I get the following error for. a huge number of sequences. Given below is an example. Could you please help me with this.

Error in rule align:
jobid: 11
output: results/aligned_data.fasta.xz, results/insertions_data.tsv, results/translations/seqs_data.gene.ORF1a.fasta.xz, results/translations/seqs_data.gene.ORF1b.fasta.xz, results/translations/seqs_data.gene.S.fasta.xz, results/translations/seqs_data.gene.ORF3a.fasta.xz, results/translations/seqs_data.gene.M.fasta.xz, results/translations/seqs_data.gene.N.fasta.xz
log: logs/align_data.txt (check log file(s) for error message)
shell:

    python3 scripts/sanitize_sequences.py             --sequences data/sequences.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_data.txt             | nextalign             --jobs=8             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --genes ORF1a,ORF1b,S,ORF3a,M,N             --sequences /dev/stdin             --output-dir results/translations             --output-basename seqs_data             --output-fasta results/aligned_data.fasta             --output-insertions results/insertions_data.tsv > logs/align_data.txt 2>&1;
    xz -2 results/aligned_data.fasta;
    xz -2 results/translations/seqs_data*.fasta
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/align_data.txt:
Warning: in sequence “Australia/VIC909/2020”: When processing gene “ORF1a”: When extracting gene “ORF1a”: Genes are expected to have length that is a multiple of 3, but the extracted Gene “ORF1a” after being stripped from insertions has length 13187. Before stripping insertions this gene had length 13203. The gene map contained the following information: start: 265, end: 13468, length: 13203. Note that this gene will not be included in the results of the sequence
Warning: in sequence “Australia/VIC909/2020”: When processing gene “ORF1b”: When extracting gene “ORF1b”: Genes are expected to have length that is a multiple of 3, but the extracted Gene “ORF1b” after being stripped from insertions has length 8081. Before stripping insertions this gene had length 8088. The gene map contained the following information: start: 13467, end: 21555, length: 8088. Note that this gene will not be included in the results of the sequence

Hi @atariq1,

these genes have frameshift mutations. The nucleotide sequences should be aligned correctly, but translations for frameshifted genes are omitted. later versions should have less noisy error messages.

I hope this helps,
richard

Thanks. When you say later version, you mean the later versions of nextstrain?

Best wishes,
Amna

What should I do to fix this issue? Every time I try to run nextstrain I get this error.

the warnings you see are unrelated to the error. Are there any other errors in the logs? Does the first part of the command finish successfully?

Other than these large number of warnings I receive the following error:

Error in rule align:
jobid: 11
output: results/aligned_data.fasta.xz, results/insertions_data.tsv, results/translations/seqs_data.gene.ORF1a.fasta.xz, results/translations/seqs_data.gene.ORF1b.fasta.xz, results/translations/seqs_data.gene.S.fasta.xz, results/translations/seqs_data.gene.ORF3a.fasta.xz, results/translations/seqs_data.gene.M.fasta.xz, results/translations/seqs_data.gene.N.fasta.xz
log: logs/align_data.txt (check log file(s) for error message)
shell:

    python3 scripts/sanitize_sequences.py             --sequences data/sequences.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_data.txt             | nextalign             --jobs=8             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --genes ORF1a,ORF1b,S,ORF3a,M,N             --sequences /dev/stdin             --output-dir results/translations             --output-basename seqs_data             --output-fasta results/aligned_data.fasta             --output-insertions results/insertions_data.tsv > logs/align_data.txt 2>&1;
    xz -2 results/aligned_data.fasta;
    xz -2 results/translations/seqs_data*.fasta
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/align_data.txt:

Could you please help me with this.

This is the complete picture of what I get in my terminal

Select jobs to execute…

[Sat Jul 3 00:42:25 2021]
Job 11:
Aligning sequences to defaults/reference_seq.fasta
- gaps relative to reference are considered real

    python3 scripts/sanitize_sequences.py             --sequences data/sequences.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_data.txt             | nextalign             --jobs=8             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --genes ORF1a,ORF1b,S,ORF3a,M,N             --sequences /dev/stdin             --output-dir results/translations             --output-basename seqs_data             --output-fasta results/aligned_data.fasta             --output-insertions results/insertions_data.tsv > logs/align_data.txt 2>&1;
    xz -2 results/aligned_data.fasta;
    xz -2 results/translations/seqs_data*.fasta

[Sat Jul 3 00:42:25 2021]
rule clade_files:
input: defaults/clades.tsv
output: results/global/clades.tsv
jobid: 48
benchmark: benchmarks/clade_files_global.txt
wildcards: build_name=global

    cat defaults/clades.tsv > results/global/clades.tsv

[Sat Jul 3 00:42:25 2021]
rule sanitize_metadata:
input: data/metadata.tsv
output: results/sanitized_metadata_data.tsv.xz
log: logs/sanitize_metadata_data.txt
jobid: 12
benchmark: benchmarks/sanitize_metadata_data.txt
wildcards: origin=data

    python3 scripts/sanitize_metadata.py             --metadata data/metadata.tsv             --parse-location-field Location             --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubtitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output results/sanitized_metadata_data.tsv.xz 2>&1 | tee logs/sanitize_metadata_data.txt

[Sat Jul 3 00:42:25 2021]
rule clade_files:
input: defaults/clades.tsv
output: results/north-america_usa/clades.tsv
jobid: 26
benchmark: benchmarks/clade_files_north-america_usa.txt
wildcards: build_name=north-america_usa

    cat defaults/clades.tsv > results/north-america_usa/clades.tsv

[Sat Jul 3 00:42:25 2021]
Finished job 48.
1 of 56 steps (2%) done
[Sat Jul 3 00:42:25 2021]
Finished job 26.
2 of 56 steps (4%) done
/bin/bash: line 1: 98709 Killed: 9 python3 scripts/sanitize_sequences.py --sequences data/sequences.fasta --strip-prefixes hCoV-19/ SARS-CoV-2/ --output /dev/stdout 2> logs/sanitize_sequences_data.txt
98711 Done | nextalign --jobs=8 --reference defaults/reference_seq.fasta --genemap defaults/annotation.gff --genes ORF1a,ORF1b,S,ORF3a,M,N --sequences /dev/stdin --output-dir results/translations --output-basename seqs_data --output-fasta results/aligned_data.fasta --output-insertions results/insertions_data.tsv > logs/align_data.txt 2>&1
[Sat Jul 3 00:51:43 2021]
Error in rule align:
jobid: 11
output: results/aligned_data.fasta.xz, results/insertions_data.tsv, results/translations/seqs_data.gene.ORF1a.fasta.xz, results/translations/seqs_data.gene.ORF1b.fasta.xz, results/translations/seqs_data.gene.S.fasta.xz, results/translations/seqs_data.gene.ORF3a.fasta.xz, results/translations/seqs_data.gene.M.fasta.xz, results/translations/seqs_data.gene.N.fasta.xz
log: logs/align_data.txt (check log file(s) for error message)
shell:

    python3 scripts/sanitize_sequences.py             --sequences data/sequences.fasta             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_data.txt             | nextalign             --jobs=8             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --genes ORF1a,ORF1b,S,ORF3a,M,N             --sequences /dev/stdin             --output-dir results/translations             --output-basename seqs_data             --output-fasta results/aligned_data.fasta             --output-insertions results/insertions_data.tsv > logs/align_data.txt 2>&1;
    xz -2 results/aligned_data.fasta;
    xz -2 results/translations/seqs_data*.fasta
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

/bin/bash: line 1: 98709 Killed:

your process was killed. possibly because it was using too much memory. This step actually runs three commands (sanitize, nextalign, and then xz to compress). Do you have a way of checking which one gets killed? What hardware are you running this on?

I am running it on Macbook pro with M1, 8GB RAM and 512 GB SSD. I can try to see which process gets killed and also look into the memory.

I do not think memory should be an issue. It seems like the sanitize metadata is the step where the process gets killed.

could you then try to only run the sanitize step (writing data to file instead of stdout or piping straight into xz or gzip)? Just to narrow down what is failing?