Diagnosing error + filtering issues

Hi all,

I’ve previously used the Nextstrain ncov pipeline successfully with a basic build. I’ve now attempted to make a more advanced build for my purposes. The analysis ran for ~2 hours (on a cluster), but then stopped abruptly. The filtering has also not occurred exactly as I intended.

1. The errors
The message in the PBS error file, and the *snakemake.log file, is as follows:

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

The error messages in the log file are as follows:

Error in rule subsample:
jobid: 87
output: results/nsw_vic/sample-country.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘country!=australia’ --group-by division year month --sequences-per-group 200 --output results/nsw_vic/sample-country.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 86
output: results/nsw_vic/sample-vic.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=victoria’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-vic.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 85
output: results/nsw_vic/sample-nsw.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Based on these errors, how can I go about fixing the analysis to run without error?

2. The filtering issue
Additionally, the filtering does not appear to be working as I intended. The data I am working with are the nextstrain-formatted sequences and associated metadata from GISAID (accessed 07/10/2020). There are 136806 sequences in the raw fasta file from GISAID, then ncov/results/filtered.fasta has 131223 sequences, and ncov/results/results/aligned-filtered.fasta has 130628 sequences.

These numbers are far larger than I intended. For example, one of the subsampling scheme I defined in builds.yaml(following the advanced customisation guide) is as follows:

subsampling:
australia:
country:
group_by: “division year month”
seq_per_group: 500
exclude: “–exclude-where ‘country!={country}’”

region:
  group_by: "country year month"
  seq_per_group: 100
  exclude: "--exclude-where 'country={country}' 'region!={region}'"
  priorities:
    type: "proximity"
    focus: "country"

global:
  group_by: "country year month"
  seq_per_group: 10
  exclude: "--exclude-where 'region={region}'"
  priorities:
    type: "proximity"
    focus: "country"

I did a rough calculation, and this filtering scheme should have left me with ~3000 sequences based on the metadata. Or am I just focusing on the wrong files here when trying to see which samples survived filtering? I intend to use the filtered, masked alignment from the pipeline both with Nextstrain and elsewhere (E.g., BEAST).

3. My guess?
From the errors, I think that some thing might be going wrong with my build designed to focus on two adjacent states in Australia: New South Wales and Victoria. I tried to use the advanced example of ‘Lac-Leman’ as a guide, and this is what I came up with:

nsw_vic:
# focal samples
nsw:
group_by: “year month”
seq_per_group: 2000
exclude: “–exclude-where ‘division!=new south wales’”
vic:
group_by: “year month”
seq_per_group: 2000
exclude: “–exclude-where ‘division!=victoria’”

# Contextual samples from the country
country:
  group_by: "division year month"
  seq_per_group: 200
  exclude: "--exclude-where 'country!=australia'"

# Contextual samples from division's region
region:
  group_by: "country year month"
  seq_per_group: 10
  exclude: "--exclude-where 'region!=oceania'"
  priorities:
    type: "proximity"
    focus: "country"
# Contextual samples from the rest of the world, excluding the current
# division to avoid resampling.
global:
  group_by: "country year month"
  seq_per_group: 5
  exclude: "--exclude-where 'region=oceania'"
  priorities:
    type: "proximity"
    focus: "country"

n.b.: I’ve got my subsampling schemes defined in the builds.yaml file as in the example - should they be elsewhere, e.g. in paramaters.yaml?

Any clues as to what I’m doing wrong, or what is going wrong, would be appreciated!

Cheers,
Charles

Hi @cfoster! Thanks for reaching out.

Regarding the filtering numbers - we actually filter multiple times during the pipeline! The initial filter just removes sequences we never include - those too short, with too many gaps, without proper dates, etc. We then run QC and filter again, excluding those with too much/too little diversity and clusters of mutations that indicate sequencing/assembly problems. Finally, we do multiple filtering steps to accomplish the ‘subsampling’ step.

The files you are seeing in ncov/results (filtered.fasta and aligned-filtered.fasta) are from the first (basic date/length) and second (QC) filtering steps. The subsampled alignments will eventually appear in a subfolder, ncov/results/<name of your build>. So, the fact there are still so many sequences in the first two files is ok!

However, it’s hard to see where the error might be. Could you try running one of the commands that’s errored in the command-line, to see if we can see a more detailed error message?

For example, just run:

augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta

And see what comes up. This should hopefully give us a more detailed error message to start working with!

HI @emmahodcroft,

Thanks for the reply, and apologies for my slow reply in return.

As you suggested to do, I tried re-running:

augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta

Running the command in isolation still resulted in an error:

augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta
invalid --exclude-where clause “south”, should be of from property=value or property!=value
invalid --exclude-where clause “wales’”, should be of from property=value or property!=value

130714 sequences were dropped during filtering
130717 of these were dropped because of ‘‘division!=new’
0 of these were dropped because of subsampling criteria

3 sequences were added back because they were in defaults/include.txt

3 sequences have been written out to results/nsw_vic/sample-nsw.fasta

It seems here that the filtering step doesn’t like that there are spaces in the name of the state I’m interested in (“New South Wales”). Should I be specifying the state differently in the build to account for these spaces (see the OP for how I’ve put it in the build)?

In any case, I tried running the full build again, and got essentially the same errors. In two builds the error occurs when trying to filter at the country level, and in the other build it’s at the division level:

Error in rule subsample:
jobid: 73
output: results/new_south_wales/sample-division.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘region!=Oceania’ ‘country!=Australia’ ‘division!=New South Wales’ --group-by year month --sequences-per-group 2000 --output results/new_south_wales/sample-division.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 87
output: results/nsw_vic/sample-country.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘country!=australia’ --group-by division year month --sequences-per-group 200 --output results/nsw_vic/sample-country.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule subsample:
jobid: 68
output: results/australia/sample-country.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘country!=Australia’ --group-by division year month --sequences-per-group 500 --output results/australia/sample-country.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I still can’t figure out how the errors are making the analysis crash. If possible, is there a way I could privately send through the log file(s) + the build to see what’s going on?

Thanks,
Charles

Hi Charles,
Thanks for writing back! And yes, was really useful for you to write that command, thanks! It does indeed look like the issue is that the area you want to include is being split into three words instead of processed as one. This is interesting, and I think must be a slightly weird quirk in our system - if you supply things as wildcards this is not an issue, but if supplied literally then this seems to be a problem!

This then causes the issue that of course there are no sequences with ‘division=new’, so everything is dropped. I think these then near-empty files cause an error in the pipeline down the road.

You should be able to solve this by changing any multi-word --exclude-where commands to lowercase and putting dashes (-) between the words - ex: --exclude-where 'division!=new-south-wales'. I have found that I did this myself in my own ‘South-Central’ builds, but am not sure where I got this from!

Hopefully this will fix your issue and let your run continue, but I’m tagging in @jlhudd as he can probably offer some insight as to why this is the case!

Thanks for posting here, @cfoster, and thank you, @emmahodcroft, for calling my attention to this issue. There seem to be multiple separate issues with the workflow, based on the examples shared so far:

  1. The subsampling command line output isn’t getting logged to a file where users can inspect it for specific error messages.

  2. All of the subsampling commands that @cfoster shared seem to fail including those without spaces in their names. For example, the following command should work as expected.

  1. The --exclude-where argument doesn’t seem to support spaces in its arguments.

We have a better chance of debugging the last two issues, if we can fix the first one. I will make a pull request to add the proper logging support for subsampling rules and post back here when that is merged into the workflow. Then, we can try running the workflow again and see what specific errors emerge.

Hi again, @cfoster! We’ve resolved the logging issue for subsampling with the ncov workflow. When you have time, would you mind pulling the latest version of the ncov repository and re-running your analysis?

Assuming you still receive errors at the subsampling steps, you should now be able to inspect the contents of the subsampling log files (for example, in logs/subsample_australia_country.txt) and get more information about why the subsample rule is failing.

Hi @jlhudd,

Thanks for putting in the effort to help diagnose these issues and continuing to maintain such a great tool.

I went into my clone of the nextstrain github repo and pulled the new changes. I then deleted all previous results, and re-ran my custom builds. As expected, the analysis died with the same errors as before (e.g.):

Error in rule subsample:
jobid: 73
output: results/new_south_wales/sample-division.fasta
log: logs/subsample_new_south_wales_division.txt (check log file(s) for error message)
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘region!=Oceania’ ‘country!=Australia’ ‘division!=New South Wales’ --group-by year month --sequences-per-group 2000 --output results/new_south_wales/sample-division.fasta 2>&1 | tee logs/subsample_new_south_wales_division.txt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

However, the log file specified in that error (i.e., logs/subsample_new_south_wales_division.txt) is actually empty. Within logs/, there are six log files related to subsampling:

  • subsample_australia_country.txt
  • subsample_victoria_division.txt
  • subsample_new_south_wales_division.txt
  • subsample_nsw_vic_nsw.txt
  • subsample_nsw_vic_country.txt
  • subsample_nsw_vic_vic.txt

The first three of these logs relate to the subsampling steps that resulted in errors, and they’re all empty. The last three logs relate to successful steps, and they all have text within. For example, subsample_nsw_vic_vic.txt:

124351 sequences were dropped during filtering
121422 of these were dropped because of ‘division!=victoria’
2932 of these were dropped because of subsampling criteria
3 sequences were added back because they were in defaults/include.txt
6366 sequences have been written out to results/nsw_vic/sample-vic.fasta

Unfortunately, I still can’t diagnose the error any further…

Cheers,
Charles

Hi again @jlhudd and @emmahodcroft,

I know you both must be exceptionally busy maintaining Nextstrain, but just a friendly bump of this thread for when you might have a chance to check it out. Any idea of what might be causing the errors in the previous post to occur? Would it help if I somehow provides my build.yaml file?

I’m about to be getting a bunch of new sequences, so it would be great to have the Nextstrain pipeline working.

Thanks,
Charles

Thank you for bumping this thread, @cfoster! It has been an unusually busy time for both @emmahodcroft and me. Looking at your last post, I’m worried that the non-zero exit code and empty log files suggest that the augur process is getting killed by the operating system. This could happen if we’re using too much memory (e.g., trying to load all sequences into memory for a filter step).

One way you might test this is to copy and paste the shell command as Snakemake prints it to the screen and run that manually from the shell. Depending on your operating system, you may get a message saying that your process has been killed. You can also inspect your resource usage while the command is running with top on Linux/Mac or “Activity Monitor” on a Mac.

I will try to benchmark this myself on my personal laptop tomorrow, but if you have time today to try it out, would you let us know what you find?

I believe the Snakemake top-level log files under .snakemake/logs/ should contain the specific exit code (and possibly more error messages) than the job-level log file. The specific exit code will provide more information about the failure.

1 Like

I just ran the first augur filter step of the workflow on my laptop and can confirm a couple of things:

  1. The uncompressed input sequence file is ~5 GB.
  2. augur filter reads all sequences into memory at once.

This means augur filter will quickly consume most available memory on a personal computer. If you have any other programs using a moderate amount of memory (e.g., Chrome, etc.), the Python process associated with augur filter will get killed for exceeding available memory. This problem will be exacerbated if you have all the GISAID sequences plus any additional local sequences that make the input to filter larger.

We clearly need to modify augur filter so it doesn’t use as much memory. I will look into this more today and make a proposal on GitHub that I can share here, too.

I didn’t get a chance (yet) to test out those steps, but thanks for trying it yourself. I’ve been running the pipeline on a cluster while requesting 12 cpus and 16 GB of memory. The input sequences file (‘sequences_2020-10-06_07-15.fasta’) has 3.8 GB of sequences. Would the potential ‘out of memory’ issue still be applicable here? I could try either running it on the cluster while requesting much greater resources, or on my laptop which has 64 GB of RAM.

Looking forward to seeing what changes are made to augur filter.

Thanks,
Charles

Memory could still be an issue on your cluster if your resources are shared and others can consume part of the 16GB (for example, if there are soft-limits on memory use for cluster jobs). But memory seems less like in that environment than on an old MacBook.

We do have a patch that fixes augur filter’s memory consumption that we hope to release by tomorrow, so we’ll soon be able to rule memory out anyway. :slight_smile:

@cfoster, we just released Augur v10.0.4 which includes the memory usage fix for augur filter. If you have a chance to upgrade to this latest version and test your workflow, would you let us know if it fixes the issue?

Hi again @jlhudd,
Thanks for the update. I’ve cloned the augur github repo and built the new version of augur using python3 -m pip install '.[full]'. I’ve submitted a Nextstrain job to our cluster, and I’ll check back in once I know if it’s successful or not.