Error in rule proximity_score

Hi there,
I’m running Nextstrain with three builds (division, country, and global). Division and country seem to run fine but I ran into the following error message with the country build:

Error in rule proximity_score:
jobid: 65
output: results/Australia/proximity_country.tsv
log: logs/subsampling_priorities_Australia_country.txt (check log file(s) for error message)
shell:

    /home/minion/opt/miniconda3/envs/nextstrain/bin/python scripts/priorities.py --alignment results/masked.fasta             --metadata data/global_metadata.tsv             --reference defaults/reference_seq.gb             --focal-alignment results/Australia/sample-country.fasta             --output results/Australia/proximity_country.tsv 2>&1 | tee logs/subsampling_priorities_Australia_country.txt
    
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

If I then restart the analysis without the country build, it seems to run fine.

I’m ok with just the division build I think but just curious as to why there’s an error in the country build only given it’s using the same data and scripts.

Hi @RobynHall, sorry you’re having trouble here! Is there anything written in the logs/subsampling_priorities_Australia_country.txt file that it references? Unfortunately sometimes when it fails there is not.

If not, you might want to try running a command like
snakemake --profile <profile> results/Australia/proximity_country.tsv -np

This won’t run anything, but it’ll print the command that it would use (the exact snakemake command) to generate that file (proximity_country.tsv). You can then copy this command and run it exactly - leave off the bit at the end where it tries to send output to a log file. This should let you monitor in real-time in the command line any error that comes up - without it getting lost somewhere on the way to the log.

If you could then let us know what the error is here, we can try to help!

Thanks @emmahodcroft. There’s nothing in the logs file unfortunately! I will have a go running the suggested command and get back to you.

PS: global and division builds did run fine all the way to completion and output is all good for those.

Here is the detailed error from the dry run:

InputFunctionException in line 331 of /home/minion/nextstrain/ncov/workflow/snakemake_rules/main_workflow.smk:
Error:
KeyError: ‘Australia’
Wildcards:
build_name=Australia
subsample=country
Traceback:
File “/home/minion/nextstrain/ncov/workflow/snakemake_rules/main_workflow.smk”, line 290, in get_priorities
File “/home/minion/nextstrain/ncov/workflow/snakemake_rules/main_workflow.smk”, line 280, in _get_subsampling_settings
File “/home/minion/nextstrain/ncov/workflow/snakemake_rules/common.smk”, line 4, in _get_subsampling_scheme_by_build_name

cheers :slight_smile:

Thanks Robyn! Hmm. This seems like it might be related to an earlier error we’ve had where I think the issue was capitalization and I’m not 100% sure why. In the builds.yaml could you try replacing the country definition with ‘australia’? If that still doesn’t work, try renaming the run (the build name at the very top of that section) with ‘australia’ as well.

We’ve had this issue crop up before (if it is this) and we need to trace through to see what’s happening, as it should work (and in many cases does work) regardless of capitalization. Let me know if that helps, and we can continue to try and see why this crops up!

@jlhudd Just tagging you in case this is the same issue as previously (or seems to be). Happy to chat about how we maybe can fix this!

I get the same error message when I try to analyze region ‘Europe’.
The log file you mentioned shows the following problem:

logs/subsampling_priorities_europe_region.txt

Done reading the alignments. Traceback (most recent call last): File "scripts/priorities.py", line 155, in <module> d = np.array(calculate_distance_matrix(context_seqs_dict['snps'], focal_seqs_dict['snps'], consensus = context_seqs_dict['consensus'])) File "scripts/priorities.py", line 107, in calculate_distance_matrix d = d + (1*(sparse_matrix_A==103) * (sparse_matrix_B.transpose()==103)) File "/home/ubuntu/miniconda3/envs/nextstrain/lib/python3.6/site-packages/scipy/sparse/base.py", line 480, in __mul__ return self._mul_sparse_matrix(other) File "/home/ubuntu/miniconda3/envs/nextstrain/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 509, in _mul_sparse_matrix np.asarray(other.indices, dtype=idx_dtype)) RuntimeError: nnz of the result is too large

Any suggestion how to fix this?
Thanks!

Hi @AlexS - I think this might be a different error you’re getting, if I’m understanding? Did changing the capitalization of the ‘australia’ build help at all?
For this European build, we’ve run into this problem a couple of times and unfortunately it seems to be a sign that either the run is too large or the sequences are too divergent. Reducing the amount of subsampling you are doing should hopefully fix this!