Diagnosing error + filtering issues

Hi all,

I’ve previously used the Nextstrain ncov pipeline successfully with a basic build. I’ve now attempted to make a more advanced build for my purposes. The analysis ran for ~2 hours (on a cluster), but then stopped abruptly. The filtering has also not occurred exactly as I intended.

1. The errors
The message in the PBS error file, and the *snakemake.log file, is as follows:

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

The error messages in the log file are as follows:

Error in rule subsample:
jobid: 87
output: results/nsw_vic/sample-country.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘country!=australia’ --group-by division year month --sequences-per-group 200 --output results/nsw_vic/sample-country.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 86
output: results/nsw_vic/sample-vic.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=victoria’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-vic.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 85
output: results/nsw_vic/sample-nsw.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Based on these errors, how can I go about fixing the analysis to run without error?

2. The filtering issue
Additionally, the filtering does not appear to be working as I intended. The data I am working with are the nextstrain-formatted sequences and associated metadata from GISAID (accessed 07/10/2020). There are 136806 sequences in the raw fasta file from GISAID, then ncov/results/filtered.fasta has 131223 sequences, and ncov/results/results/aligned-filtered.fasta has 130628 sequences.

These numbers are far larger than I intended. For example, one of the subsampling scheme I defined in builds.yaml(following the advanced customisation guide) is as follows:

subsampling:
australia:
country:
group_by: “division year month”
seq_per_group: 500
exclude: “–exclude-where ‘country!={country}’”

region:
  group_by: "country year month"
  seq_per_group: 100
  exclude: "--exclude-where 'country={country}' 'region!={region}'"
  priorities:
    type: "proximity"
    focus: "country"

global:
  group_by: "country year month"
  seq_per_group: 10
  exclude: "--exclude-where 'region={region}'"
  priorities:
    type: "proximity"
    focus: "country"

I did a rough calculation, and this filtering scheme should have left me with ~3000 sequences based on the metadata. Or am I just focusing on the wrong files here when trying to see which samples survived filtering? I intend to use the filtered, masked alignment from the pipeline both with Nextstrain and elsewhere (E.g., BEAST).

3. My guess?
From the errors, I think that some thing might be going wrong with my build designed to focus on two adjacent states in Australia: New South Wales and Victoria. I tried to use the advanced example of ‘Lac-Leman’ as a guide, and this is what I came up with:

nsw_vic:
# focal samples
nsw:
group_by: “year month”
seq_per_group: 2000
exclude: “–exclude-where ‘division!=new south wales’”
vic:
group_by: “year month”
seq_per_group: 2000
exclude: “–exclude-where ‘division!=victoria’”

# Contextual samples from the country
country:
  group_by: "division year month"
  seq_per_group: 200
  exclude: "--exclude-where 'country!=australia'"

# Contextual samples from division's region
region:
  group_by: "country year month"
  seq_per_group: 10
  exclude: "--exclude-where 'region!=oceania'"
  priorities:
    type: "proximity"
    focus: "country"
# Contextual samples from the rest of the world, excluding the current
# division to avoid resampling.
global:
  group_by: "country year month"
  seq_per_group: 5
  exclude: "--exclude-where 'region=oceania'"
  priorities:
    type: "proximity"
    focus: "country"

n.b.: I’ve got my subsampling schemes defined in the builds.yaml file as in the example - should they be elsewhere, e.g. in paramaters.yaml?

Any clues as to what I’m doing wrong, or what is going wrong, would be appreciated!

Cheers,
Charles

Hi @cfoster! Thanks for reaching out.

Regarding the filtering numbers - we actually filter multiple times during the pipeline! The initial filter just removes sequences we never include - those too short, with too many gaps, without proper dates, etc. We then run QC and filter again, excluding those with too much/too little diversity and clusters of mutations that indicate sequencing/assembly problems. Finally, we do multiple filtering steps to accomplish the ‘subsampling’ step.

The files you are seeing in ncov/results (filtered.fasta and aligned-filtered.fasta) are from the first (basic date/length) and second (QC) filtering steps. The subsampled alignments will eventually appear in a subfolder, ncov/results/<name of your build>. So, the fact there are still so many sequences in the first two files is ok!

However, it’s hard to see where the error might be. Could you try running one of the commands that’s errored in the command-line, to see if we can see a more detailed error message?

For example, just run:

augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta

And see what comes up. This should hopefully give us a more detailed error message to start working with!

HI @emmahodcroft,

Thanks for the reply, and apologies for my slow reply in return.

As you suggested to do, I tried re-running:

augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta

Running the command in isolation still resulted in an error:

augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta
invalid --exclude-where clause “south”, should be of from property=value or property!=value
invalid --exclude-where clause “wales’”, should be of from property=value or property!=value

130714 sequences were dropped during filtering
130717 of these were dropped because of ‘‘division!=new’
0 of these were dropped because of subsampling criteria

3 sequences were added back because they were in defaults/include.txt

3 sequences have been written out to results/nsw_vic/sample-nsw.fasta

It seems here that the filtering step doesn’t like that there are spaces in the name of the state I’m interested in (“New South Wales”). Should I be specifying the state differently in the build to account for these spaces (see the OP for how I’ve put it in the build)?

In any case, I tried running the full build again, and got essentially the same errors. In two builds the error occurs when trying to filter at the country level, and in the other build it’s at the division level:

Error in rule subsample:
jobid: 73
output: results/new_south_wales/sample-division.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘region!=Oceania’ ‘country!=Australia’ ‘division!=New South Wales’ --group-by year month --sequences-per-group 2000 --output results/new_south_wales/sample-division.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 87
output: results/nsw_vic/sample-country.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘country!=australia’ --group-by division year month --sequences-per-group 200 --output results/nsw_vic/sample-country.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 68
output: results/australia/sample-country.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘country!=Australia’ --group-by division year month --sequences-per-group 500 --output results/australia/sample-country.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I still can’t figure out how the errors are making the analysis crash. If possible, is there a way I could privately send through the log file(s) + the build to see what’s going on?

Thanks,
Charles

Hi Charles,
Thanks for writing back! And yes, was really useful for you to write that command, thanks! It does indeed look like the issue is that the area you want to include is being split into three words instead of processed as one. This is interesting, and I think must be a slightly weird quirk in our system - if you supply things as wildcards this is not an issue, but if supplied literally then this seems to be a problem!

This then causes the issue that of course there are no sequences with ‘division=new’, so everything is dropped. I think these then near-empty files cause an error in the pipeline down the road.

You should be able to solve this by changing any multi-word --exclude-where commands to lowercase and putting dashes (-) between the words - ex: --exclude-where 'division!=new-south-wales'. I have found that I did this myself in my own ‘South-Central’ builds, but am not sure where I got this from!

Hopefully this will fix your issue and let your run continue, but I’m tagging in @jlhudd as he can probably offer some insight as to why this is the case!

Thanks for posting here, @cfoster, and thank you, @emmahodcroft, for calling my attention to this issue. There seem to be multiple separate issues with the workflow, based on the examples shared so far:

  1. The subsampling command line output isn’t getting logged to a file where users can inspect it for specific error messages.

  2. All of the subsampling commands that @cfoster shared seem to fail including those without spaces in their names. For example, the following command should work as expected.

  1. The --exclude-where argument doesn’t seem to support spaces in its arguments.

We have a better chance of debugging the last two issues, if we can fix the first one. I will make a pull request to add the proper logging support for subsampling rules and post back here when that is merged into the workflow. Then, we can try running the workflow again and see what specific errors emerge.

Hi again, @cfoster! We’ve resolved the logging issue for subsampling with the ncov workflow. When you have time, would you mind pulling the latest version of the ncov repository and re-running your analysis?

Assuming you still receive errors at the subsampling steps, you should now be able to inspect the contents of the subsampling log files (for example, in logs/subsample_australia_country.txt) and get more information about why the subsample rule is failing.

Hi @jlhudd,

Thanks for putting in the effort to help diagnose these issues and continuing to maintain such a great tool.

I went into my clone of the nextstrain github repo and pulled the new changes. I then deleted all previous results, and re-ran my custom builds. As expected, the analysis died with the same errors as before (e.g.):

Error in rule subsample:
jobid: 73
output: results/new_south_wales/sample-division.fasta
log: logs/subsample_new_south_wales_division.txt (check log file(s) for error message)
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘region!=Oceania’ ‘country!=Australia’ ‘division!=New South Wales’ --group-by year month --sequences-per-group 2000 --output results/new_south_wales/sample-division.fasta 2>&1 | tee logs/subsample_new_south_wales_division.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

However, the log file specified in that error (i.e., logs/subsample_new_south_wales_division.txt) is actually empty. Within logs/, there are six log files related to subsampling:

  • subsample_australia_country.txt
  • subsample_victoria_division.txt
  • subsample_new_south_wales_division.txt
  • subsample_nsw_vic_nsw.txt
  • subsample_nsw_vic_country.txt
  • subsample_nsw_vic_vic.txt

The first three of these logs relate to the subsampling steps that resulted in errors, and they’re all empty. The last three logs relate to successful steps, and they all have text within. For example, subsample_nsw_vic_vic.txt:

124351 sequences were dropped during filtering
121422 of these were dropped because of ‘division!=victoria’
2932 of these were dropped because of subsampling criteria
3 sequences were added back because they were in defaults/include.txt
6366 sequences have been written out to results/nsw_vic/sample-vic.fasta

Unfortunately, I still can’t diagnose the error any further…

Cheers,
Charles