I’m looking for some advice on how to modify my Nextstrain runs/builds as I am starting to encounter extremely large memory requirements due to the amount of data now in GISAID.
I normally create builds for the Southeast USA, FL-subsampled (with global samples), and FL only sequences. I run Nextstrain using a profile with a subsampling scheme based on examples from Emma. I am now using the new max_sequences key to set a size limit to my builds so they are not too large. I use the nextmeta and nextfasta files downloaded from GISAID as my input files.
However, on my latest run (on an HPC using SLURM) I got an out of memory error when I requested 225GB. My last run I did required over 125GB memory.
As more and more data are deposited into GISAID, I just don’t think this approach will be feasible for our runs.
Do you have any advice on how we can continue to run FL-focused subsampled builds (so we can continue to pull in global samples for context in our FL trees) but not have to use so much memory?
Thank you for this information. Have these steps already been automatically updated in the ncov repo? I do run the pipeline after doing a fresh pull each time.
Or do I need to make these changes manually myself?
HI @seschmedes, The reduction in memory usage by the priorities calculations will automatically take effect when you pull the latest ncov workflow. The nextalign feature is opt-in right now. You can enable it by adding use_nextalign: True to your builds.yaml file(s) like we do in the Nextstrain builds.
You also need to install the nextalign binary into your workflow’s environment. There are a couple of ways to install nextalign:
If you run your workflow with snakemake --use-conda, nextalign will be automatically installed after you pull the latest ncov repository changes and run the workflow. This is the best way to keep your workflow environment’s software up-to-date.
If you use a nextstrain conda environment, you can install nextalign with conda install -c conda-forge -c bioconda nextalign.