Large memory requirements

seschmedes · March 5, 2021, 6:57pm

Hello,

I’m looking for some advice on how to modify my Nextstrain runs/builds as I am starting to encounter extremely large memory requirements due to the amount of data now in GISAID.

I normally create builds for the Southeast USA, FL-subsampled (with global samples), and FL only sequences. I run Nextstrain using a profile with a subsampling scheme based on examples from Emma. I am now using the new max_sequences key to set a size limit to my builds so they are not too large. I use the nextmeta and nextfasta files downloaded from GISAID as my input files.

However, on my latest run (on an HPC using SLURM) I got an out of memory error when I requested 225GB. My last run I did required over 125GB memory.

As more and more data are deposited into GISAID, I just don’t think this approach will be feasible for our runs.

Do you have any advice on how we can continue to run FL-focused subsampled builds (so we can continue to pull in global samples for context in our FL trees) but not have to use so much memory?

Thanks for your help,
Sarah Schmedes

rneher · March 6, 2021, 10:17am

There are two parts of the pipeline that potentially require a large amount of memory.

alignment with mafft. This can now be reduced dramatically by using out inhouse align tool nextalign
ncov/main_workflow.smk at master · nextstrain/ncov · GitHub
the calculation of priorities. This used to require a lot of memory, but we have recently changed the way it works to reduce its memory requirements.
ncov/main_workflow.smk at master · nextstrain/ncov · GitHub

I hope this helps,
richard

seschmedes · March 9, 2021, 1:08pm

Hi Richard,

Thank you for this information. Have these steps already been automatically updated in the ncov repo? I do run the pipeline after doing a fresh pull each time.

Or do I need to make these changes manually myself?

Thanks,
Sarah

jlhudd · March 9, 2021, 5:43pm

HI @seschmedes, The reduction in memory usage by the priorities calculations will automatically take effect when you pull the latest ncov workflow. The nextalign feature is opt-in right now. You can enable it by adding use_nextalign: True to your builds.yaml file(s) like we do in the Nextstrain builds.

You also need to install the nextalign binary into your workflow’s environment. There are a couple of ways to install nextalign:

If you run your workflow with snakemake --use-conda, nextalign will be automatically installed after you pull the latest ncov repository changes and run the workflow. This is the best way to keep your workflow environment’s software up-to-date.
If you use a nextstrain conda environment, you can install nextalign with conda install -c conda-forge -c bioconda nextalign.
Otherwise, you can download a binary executable from nextalign’s latest release on GitHub and place this file in your PATH (e.g., /usr/local/bin on Linux or OS X).

Let us know if you run into any issues with these updates, though, and we can troubleshoot them here.

seschmedes · March 9, 2021, 5:56pm

Great! Thank you so much! I will try to test this out later today or tomorrow and will report back if I run into any issues.

Thank you all for your help! It’s greatly appreciated!

Thanks,
Sarah

dlu · March 30, 2021, 9:20pm

This is very helpful discussion!! Now there is a more detailed page about using nextalign for future reference.

underscore · April 10, 2021, 9:27am

Did anyone test GitHub - bbuchfink/diamond: Accelerated BLAST compatible local sequence aligner. It is published in https://www.nature.com/articles/s41592-021-01101-x

Topic		Replies	Views
Avoiding large memory requirements by switching to `nextalign` General	0	910	March 22, 2021
Regarding Error with nextstrain - Memory Problems Help and Getting Started	3	473	October 18, 2021
Augur alignment failing - problem with mafft Help and Getting Started	6	2748	March 22, 2021
Error in rule align Help and Getting Started	3	687	January 22, 2021
Error in rule tree Help and Getting Started	22	1318	April 5, 2021

Large memory requirements

Related topics