Make reproducible workflows with conda environments and pinned augur versions

Summary

  • Define conda environments per rule in your Snakemake workflows.
  • Pin augur versions in conda environment files.
  • Run your workflows with snakemake --use-conda.

Software management for reproducible workflows

Workflows help us define and run complex analyses through a standard interface. An implicit assumption we make when we create a workflow is that we will want to run it many times. Often, we want to enable others (including our future selves) to run the same workflow on a different computer and get the same results. To this end, we need a way to control the software that is available to our workflows. Two common solutions to this problem are to run workflows in:

  • containers (e.g., Docker and Singularity) that provide a maintainer-defined file and operating system with all dependencies preinstalled
  • environments (e.g., conda and virtualenvs) that provide a user-defined collection of software that is designed to run on your existing operating system

The Nextstrain team uses both of these solutions in different contexts. We use containers via the Nextstrain CLI to run SARS-CoV-2 and seasonal influenza workflows on AWS Batch. We use conda environments to run workflows on high performance compute clusters where Docker is not supported for security reasons. We also use Conda environments when our workflows require custom software or specific versions of software that are not included in pre-built container images.

Conda environments are defined by a single YAML file that lists which software packages and versions of that software should be installed. For example, the Nextstrain team uses a standard conda environment for most projects. When a package is listed without a specific version, the latest version conda will install the latest version.

Managing workflow software with Snakemake

Workflow managers like Snakemake and Nextflow support running workflow jobs inside custom conda environments. The example below, from the Snakemake documentation, shows how each rule can define its own conda environment.

rule NAME:
    input:
        "table.txt"
    output:
        "plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    script:
        "scripts/plot-stuff.R"

To run your workflow with conda environments enabled, provide the --use-conda flag as shown below.

snakemake --cores 1 --use-conda

The first time you run Snakemake with the --use-conda flag, Snakemake automatically:

  • creates a new conda environment for each distinct environment definition in the workflow
  • installs the packages defined for that environment
  • activates an environment prior to running each rule that uses it

The next time you run Snakemake with the --use-conda flag, Snakemake detects the existing environment (stored in .snakemake/conda/, by default) and activates each environment as needed for each rule. When you change a conda environment by modifying its YAML file and run Snakemake with --use-conda, Snakemake will detect these changes and create a new environment based on the updated environment definition.

Defining a conda environment for your workflow

Nextstrain’s SARS-CoV-2 workflows define the same conda environment for each rule. The full environment file looks like this (with inline comments added here for clarity):

# The name of your conda environment. This matters for
# environments you activate manually, but it does not matter
# for environments maintained and activated by Snakemake.
name: nextstrain
# These channels tell conda where to look for the packages
# requested below. Many core bioinformatics tools are
# available through the "bioconda" channel. Many core
# libraries in Python's scientific stack are available through
# the "defaults" channel. For more details see:
# https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html
channels:
- bioconda
- defaults
# The dependencies list packages that should be installed
# from the conda channels above. These include tools augur
# needs to create multiple sequence alignments and trees.
# Note that package versions use the conda-specific syntax of
# `package=version`.
dependencies:
- cvxopt
- datrie
- fasttree
- iqtree
- mafft
- opencv
- pandas
- psutil
- python=3.6*
- nodejs=10
- raxml
- vcftools
- git
- pip
# The pip dependencies list Python packages that are not
# available through conda channels or that we prefer to
# install from PyPI with pip. Note that package versions
# here use the pip-specific syntax of `package==version`.
# For more details about installing Python packages see:
# https://packaging.python.org/tutorials/installing-packages/
- pip:
  - awscli==1.18.45
  - nextstrain-augur==9.0.0
  - nextstrain-cli==1.16.5
  - rethinkdb==2.3.0.post6

Notice that most software we request doesn’t have a specific version. This means we prefer to use the latest available version of each package. Other packages require a specific version because our workflow relies on specific features that are only available in at least those versions of the package.

Pinning specific versions of software in a conda environment

We specify, or pin, the specific version of the augur package, nextstrain-augur, when we want access to the features provided by that version and we don’t want to break our workflow by automatically installing the latest version of augur that may include backwards-incompatible or “breaking” changes.

Augur uses Semantic Versioning rules such that each version has the format of X.Y.Z where X is the major release number, Y is the minor release, and Z is the patch release. The rules mean that you are generally safe installing the latest augur package within a given major version (e.g., 9.0.0). These rules also mean you should upgrade to the next major version (e.g., 10.0.0) after consulting the augur changelog and testing your workflow.

When you are ready to upgrade augur, change the version in the conda environment file to the version you’d like to use.

# Edit the conda environment file with your preferred editor.
open workflow/envs/nextstrain.yaml

The following example shows the snippet of the full conda environment from above with augur upgrades to version 10.0.0.

- pip:
  - awscli==1.18.45
  - nextstrain-augur==10.0.0
  - nextstrain-cli==1.16.5

After you make this change and run your workflow with snakemake --cores 1 --use-conda, Snakemake will automatically detect the change and install the latest version of augur.

Overriding the default conda environment in ncov workflows

By default, the Nextstrain ncov workflow will use the conda environment file described above. This file is defined in defaults/parameters.yaml by the following line:

conda_environment: "../envs/nextstrain.yaml"

You can override the default value by passing a path to your own conda environment file to Snakemake’s configuration. The easiest way to make this change is to modify your profile’s config.yaml file (see the SARS-CoV-2 Nextstrain tutorial for more about profiles). For example, the following code block shows the config for the “getting started” profile with an additional conda_environment setting.

# This analysis-specific config file overrides the settings in the default config file.
# If a parameter is not defined here, it will fall back to the default value.

configfile:
  - defaults/parameters.yaml # Pull in the default values
  - my_profiles/getting_started/builds.yaml # Pull in our list of desired builds

config:
  - sequences=data/example_sequences.fasta
  - metadata=data/example_metadata.tsv
  # Override the default conda environment.
  # This path is relative to the `workflow/snakemake_rules/` directory.
  - conda_environment=../../my_profiles/getting_started/conda.yaml

# Set the maximum number of cores you want Snakemake to use for this pipeline.
cores: 1

# Always print the commands that will be run to the screen for debugging.
printshellcmds: True

This approach allows you to try out different combinations of software or different versions of existing software (e.g., the latest version of augur) without disrupting your existing workflow.

What’s next?

We’re constantly trying to make our workflows easier to maintain and avoid breaking workflows when we release new versions of augur and other Nextstrain tools. In the future, we hope to provide better support for the augur Bioconda package, Docker images that support the kind of --use-conda pattern described above, and maybe even Snakemake wrappers.

We would love to hear more about how you manage your dependencies in workflows like this and what pitfalls you’ve learned to avoid. Please comment below if you would like to share your experiences.

1 Like