I’m looking for some advice on how to modify my Nextstrain runs/builds as I am starting to encounter extremely large memory requirements due to the amount of data now in GISAID.
I normally create builds for the Southeast USA, FL-subsampled (with global samples), and FL only sequences. I run Nextstrain using a profile with a subsampling scheme based on examples from Emma. I am now using the new max_sequences key to set a size limit to my builds so they are not too large. I use the nextmeta and nextfasta files downloaded from GISAID as my input files.
However, on my latest run (on an HPC using SLURM) I got an out of memory error when I requested 225GB. My last run I did required over 125GB memory.
As more and more data are deposited into GISAID, I just don’t think this approach will be feasible for our runs.
Do you have any advice on how we can continue to run FL-focused subsampled builds (so we can continue to pull in global samples for context in our FL trees) but not have to use so much memory?
Thanks for your help,