Large memory requirements

Hello,

I’m looking for some advice on how to modify my Nextstrain runs/builds as I am starting to encounter extremely large memory requirements due to the amount of data now in GISAID.

I normally create builds for the Southeast USA, FL-subsampled (with global samples), and FL only sequences. I run Nextstrain using a profile with a subsampling scheme based on examples from Emma. I am now using the new max_sequences key to set a size limit to my builds so they are not too large. I use the nextmeta and nextfasta files downloaded from GISAID as my input files.

However, on my latest run (on an HPC using SLURM) I got an out of memory error when I requested 225GB. My last run I did required over 125GB memory.

As more and more data are deposited into GISAID, I just don’t think this approach will be feasible for our runs.

Do you have any advice on how we can continue to run FL-focused subsampled builds (so we can continue to pull in global samples for context in our FL trees) but not have to use so much memory?

Thanks for your help,
Sarah Schmedes

1 Like

There are two parts of the pipeline that potentially require a large amount of memory.

  1. alignment with mafft. This can now be reduced dramatically by using out inhouse align tool nextalign
    ncov/main_workflow.smk at master · nextstrain/ncov · GitHub

  2. the calculation of priorities. This used to require a lot of memory, but we have recently changed the way it works to reduce its memory requirements.
    ncov/main_workflow.smk at master · nextstrain/ncov · GitHub

I hope this helps,
richard

1 Like

Hi Richard,

Thank you for this information. Have these steps already been automatically updated in the ncov repo? I do run the pipeline after doing a fresh pull each time.

Or do I need to make these changes manually myself?

Thanks,
Sarah

HI @seschmedes, The reduction in memory usage by the priorities calculations will automatically take effect when you pull the latest ncov workflow. The nextalign feature is opt-in right now. You can enable it by adding use_nextalign: True to your builds.yaml file(s) like we do in the Nextstrain builds.

You also need to install the nextalign binary into your workflow’s environment. There are a couple of ways to install nextalign:

  1. If you run your workflow with snakemake --use-conda, nextalign will be automatically installed after you pull the latest ncov repository changes and run the workflow. This is the best way to keep your workflow environment’s software up-to-date.

  2. If you use a nextstrain conda environment, you can install nextalign with conda install -c conda-forge -c bioconda nextalign.

  3. Otherwise, you can download a binary executable from nextalign’s latest release on GitHub and place this file in your PATH (e.g., /usr/local/bin on Linux or OS X).

Let us know if you run into any issues with these updates, though, and we can troubleshoot them here.

Great! Thank you so much! I will try to test this out later today or tomorrow and will report back if I run into any issues.

Thank you all for your help! It’s greatly appreciated!

Thanks,
Sarah

This is very helpful discussion!! Now there is a more detailed page about using nextalign for future reference.

Did anyone test GitHub - bbuchfink/diamond: Accelerated BLAST compatible local sequence aligner. It is published in https://www.nature.com/articles/s41592-021-01101-x