Unsure why data is filtering out

Hello,
I am unsure if I am missing something relatively obvious here, so I hope you can help, because I am not sure where I am going wrong. I will soon be receiving data from several sources, but I thought that I would try to produce a build using samples from GISAID. I am trying to make something for just Sweden (for which GISAID has around 7000 samples).

I have thus taken all of the data directly from GISAID (the files named FASTA and metadata in their downloads section) and set that as the data in a build. I have modified build.yaml (from the example tutorial) to show a country level build in accordance to the tutorial.I have also modified the config file accordingly (so everything ‘points’ to where it should). I am hoping to get something like neherlab have for Sweden (in the local build shown for it where there is data shown in each region). I haven’t changed anything else from the tutorial files really.

Unfortunately, all of the data for Sweden is filtered out (along with most other samples!) and I am not really sure why.

Any advice that you could give me would be very much appreciated.

Hi @l.hughes - thanks for reaching out! It might be helpful to copy your build.yaml file here (or give a link to where it can be seen, for example on github), so that it’s possible to see what you have currently.

If all Swedish samples are being filtered it it might mean that something isn’t quite being set up correctly.

You can actually see how the Neher Lab builds are set up to run the European build (including Sweden), as the build.yaml is available here - I’ve set it to where the country-level sampling takes place as it might be useful to compare to your own file!

Hello, Thanks so much for your help, I really appreciate it! I have set up a GitHub repository with just the builds and config file that I am currently working with (GitHub - LianeHughes/testNextstrain). As you’ll see, I have changed things a little bit to use the Neher Lab code that you linked (and using the custom snakemake rules used in that build too). I have included just a build for Sweden. With this set up, I now get everything running for about 3 hours before I get the following error in my terminal and log (before everything stops):

Error in rule align:
jobid: 34
output: results/aligned_gisaid.fasta
log: logs/align_gisaid.txt (check log file(s) for error message)
shell:

        mafft                 --auto                 --thread 2                 --keeplength                 --addfragments                 data/GISAID/sequences.fasta                 defaults/reference_seq.fasta > results/aligned_gisaid.fasta 2> logs/align_gisaid.txt
        
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/align_gisaid.txt:
nadd = 732288
npair = 732288
nseq = 732289
nlen = 59576
use ktuples, size=6!
nadd = 732288
ppenalty_ex = -10
nthread = 2
blosum 62 / kimura 200
sueff_global = 0.100000
norg = 1
njobc = 2
generating a scoring matrix for nucleotide (dist=200) … done

Making a distance matrix …
(filepath)opt/miniconda3/envs/nextstrain/bin/mafft: line 2747: 5455 Killed: 9 “$prefix/addsingle” -Q 100 $legacygapopt -W $tuplesize -O $outnum $addsinglearg $addarg $add2ndhalfarg -C $numthreads $memopt $weightopt $treeinopt $treeoutopt $distoutopt $seqtype $model -f “-”$gop -h $aof $param_fft $localparam $algopt $treealg $scoreoutarg < infile > /dev/null 2>> “$progressfile”

Hi @l.hughes, this error is caused when mafft tries to use more memory than is available on your machine and its process gets killed. We think this issue is caused by a memory leak in mafft. The best workaround for now is to switch from mafft to our nextalign tool. @rneher has written a guide for making this switch to nextalign.

Would you be willing to try out this alternate approach and let us know how it works for you?

1 Like

Hello, thank you for your help. I followed the guide for switching to next align and updated my Conda environment using the instructions here: Local Installation — Nextstrain documentation

but unfortunately I got the following error:

Error in rule align:
jobid: 35
output: results/aligned_gisaid.fasta, results/insertions_gisaid.tsv, results/translations/seqs_gisaid.gene.ORF1a.fasta, results/translations/seqs_gisaid.gene.ORF1b.fasta, results/translations/seqs_gisaid.gene.S.fasta, results/translations/seqs_gisaid.gene.ORF3a.fasta, results/translations/seqs_gisaid.gene.M.fasta, results/translations/seqs_gisaid.gene.N.fasta
log: logs/align_gisaid.txt (check log file(s) for error message)
shell:

        nextalign                 --jobs=2                 --reference defaults/reference_seq.fasta                 --genemap defaults/annotation.gff                 --genes ORF1a,ORF1b,S,ORF3a,M,N                 --sequences data/GISAID/sequences.fasta                 --output-dir results/translations                 --output-basename seqs_gisaid                 --output-fasta results/aligned_gisaid.fasta                 --output-insertions results/insertions_gisaid.tsv > logs/align_gisaid.txt 2>&1
        
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Logfile logs/align_gisaid.txt:
/bin/bash: nextalign: command not found

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/liahu895/Documents/scilifelab/COVID_portal/Nextstrain/COV_dasboard/ncov/.snakemake/log/2021-04-06T155230.763044.snakemake.log

I have tried going through the update stages again just in case something went wrong, but it says that everything is updated:

conda update --all

Collecting package metadata (current_repodata.json): done

Solving environment: done

All requested packages already installed.

Thanks again for any advice you can give me!

Ah, I’m sorry about that @l.hughes! That’s a bug in our documentation where our docs don’t yet reflect nextalign as a potential dependency in the Conda environment.

The best way to make sure you have all the packages installed to run the workflow is to use Snakemake’s --use-conda flag (e.g., snakemake --use-conda --profile my_profiles/getting_started). This will look at the workflow’s default Conda environment file that we keep updated. Whenever we change the dependencies for the workflow and you pull those changes into your local repository, Snakemake will see those changes and update your Conda environment automatically.

If you prefer to run Snakemake from inside your existing Nextstrain Conda environment, you can install nextalign with the following command from inside your active environment:

conda install -c conda-forge -c bioconda nextalign

Hello, no problem at all! Thank you so much for all your (super speedy!) help. For now, I will install in my current Conda environment and test it out there. In future though, I’ll make sure to pull more often. I’ll let you know if I have any more questions. Thanks!

1 Like

Hello,
Thank you so much for all your help. This worked for me. I ran into a recursion error, but I adjusted this to be 10,000 (instead of the default 1000) and this worked perfectly!

I have (hopefully!) one last (set of) question(s). I have used the Neherlab build as a basis here (see above) to focus in on one country (using the custom_country settings from them). In my build, I see one big ‘blob’ per country, but in the Neherlab build, there is more granularity (so there is breakdown by division in each country). I know how to make the build focus on a given division (as with e.g. Washington king county in the examples). However, I would like to keep the focus country-wide, but with a breakdown per division (as in Neherlab’s build) (or potentially at more granular levels in future). Is the best way to do this to have a list of each division (or lower level) you want to focus on, specify a subsampling scheme for it (kind of like how Neherlab have specified each country in theirs) and then filter the view in auspice by country (is there a way to default it the filter initially used in Auspice?).

Thanks again!