Hi all,
I’ve previously used the Nextstrain ncov pipeline successfully with a basic build. I’ve now attempted to make a more advanced build for my purposes. The analysis ran for ~2 hours (on a cluster), but then stopped abruptly. The filtering has also not occurred exactly as I intended.
1. The errors
The message in the PBS error file, and the *snakemake.log file, is as follows:
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The error messages in the log file are as follows:
Error in rule subsample:
jobid: 87
output: results/nsw_vic/sample-country.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘country!=australia’ --group-by division year month --sequences-per-group 200 --output results/nsw_vic/sample-country.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Error in rule subsample:
jobid: 86
output: results/nsw_vic/sample-vic.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=victoria’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-vic.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Error in rule subsample:
jobid: 85
output: results/nsw_vic/sample-nsw.fasta
shell:
augur filter --sequences results/masked.fasta --metadata data/metadata_2020-10-06_12-59.tsv --include defaults/include.txt --exclude-where ‘division!=new south wales’ --group-by year month --sequences-per-group 2000 --output results/nsw_vic/sample-nsw.fasta 2>&1 | tee
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Based on these errors, how can I go about fixing the analysis to run without error?
2. The filtering issue
Additionally, the filtering does not appear to be working as I intended. The data I am working with are the nextstrain-formatted sequences and associated metadata from GISAID (accessed 07/10/2020). There are 136806 sequences in the raw fasta file from GISAID, then ncov/results/filtered.fasta has 131223 sequences, and ncov/results/results/aligned-filtered.fasta has 130628 sequences.
These numbers are far larger than I intended. For example, one of the subsampling scheme I defined in builds.yaml
(following the advanced customisation guide) is as follows:
subsampling:
australia:
country:
group_by: “division year month”
seq_per_group: 500
exclude: “–exclude-where ‘country!={country}’”
region:
group_by: "country year month"
seq_per_group: 100
exclude: "--exclude-where 'country={country}' 'region!={region}'"
priorities:
type: "proximity"
focus: "country"
global:
group_by: "country year month"
seq_per_group: 10
exclude: "--exclude-where 'region={region}'"
priorities:
type: "proximity"
focus: "country"
I did a rough calculation, and this filtering scheme should have left me with ~3000 sequences based on the metadata. Or am I just focusing on the wrong files here when trying to see which samples survived filtering? I intend to use the filtered, masked alignment from the pipeline both with Nextstrain and elsewhere (E.g., BEAST).
3. My guess?
From the errors, I think that some thing might be going wrong with my build designed to focus on two adjacent states in Australia: New South Wales and Victoria. I tried to use the advanced example of ‘Lac-Leman’ as a guide, and this is what I came up with:
nsw_vic:
# focal samples
nsw:
group_by: “year month”
seq_per_group: 2000
exclude: “–exclude-where ‘division!=new south wales’”
vic:
group_by: “year month”
seq_per_group: 2000
exclude: “–exclude-where ‘division!=victoria’”
# Contextual samples from the country
country:
group_by: "division year month"
seq_per_group: 200
exclude: "--exclude-where 'country!=australia'"
# Contextual samples from division's region
region:
group_by: "country year month"
seq_per_group: 10
exclude: "--exclude-where 'region!=oceania'"
priorities:
type: "proximity"
focus: "country"
# Contextual samples from the rest of the world, excluding the current
# division to avoid resampling.
global:
group_by: "country year month"
seq_per_group: 5
exclude: "--exclude-where 'region=oceania'"
priorities:
type: "proximity"
focus: "country"
n.b.: I’ve got my subsampling schemes defined in the builds.yaml file as in the example - should they be elsewhere, e.g. in paramaters.yaml?
Any clues as to what I’m doing wrong, or what is going wrong, would be appreciated!
Cheers,
Charles