Managing Research Software Environments - Swarthmore College

The landscape of research software is certainly rich and varied, driven in part by improved sharing and dissemination platforms (such as GitHub), as well as by enhanced mandates to provide access to software and workflows arising from nearly any federal grant (NSF, NIH, etc.). On the one hand, this affords researchers a robust toolkit to bring to bear on their scientific workflows. On the other hand, however, each tool may have been developed in an environment exceptionally dissimilar to one you have access to. As a result, programs or entire workflows might rely on a particular operating system (RHEL, Debian, etc.), a specific version of a compiler, mandatory libraries, or other dependencies that are incredibly challenging – or outright impossible – to install locally. This issue works both ways: how can you, as a researcher, ensure that software you develop is available and usable by the greatest number of people?

Thankfully, various tools have developed to address this issue, though to be clear much work remains to make some of them user-friendly. While this post is not meant to cover all tools and their attendant use cases, it will introduce a handful of them, including Anaconda and Spack.

Anaconda (usually just called Conda) is one of the most robust and widely-used tools to create and share Python-based research environments. The basic steps – as outlined in our help documentation – involve creating a distinct named environment that will contain the specific version of Python and other packages that drive your research workflow, insulating them from system-level versions that may change, go away, etc. These environments can be loaded and changed as needed, and can be exported easily to ensure that anyone working on the same project is using an identical environment, a critical cornerstone of reproducible science.

Anaconda works great for projects that are entirely, or largely, Python-based, but what happens when you need a broader suite of tools, or perhaps have software that requires a newer/older version of low-level system tools, such as gcc? Spack provides a way to manage such an environment (note: Spack is exceptionally complex, so this brief post can’t remotely convey anything beyond an extremely simple use case). While a more complete help document will be forthcoming, an excellent Spack tutorial is available. In general, however, it functions similar to Conda, where you can create an environment and install various specific versions of software that become available when the environment is activated. And, like Conda, these environments can be exported and shared. Here is a brief example to install a local copy of the most recent gcc on a Linux system:

git clone --depth=100 --branch=releases/v0.20 https://github.com/spack/spack.git ~/spack
cd spack
. share/spack/setup-env.sh
spack env create testenv
spack env activate testenv
spack add gcc
spack install

At this point, you can check whether it worked by running:

which gcc

It should output a path similar to this:

~/spack/var/spack/environments/testenv/.spack-env/view/bin/gcc

Assuming it does, you need to add the libraries to your system environment:

export LD_LIBRARY_PATH=~/spack/var/spack/environments/testenv/.spack-env/view/lib64

When finished working in your Spack environment, deactivate it with:

spack env deactivate

The need to manage complex software environments is only growing. Tools exist, however, to assist with doing so. Not only can this be helpful for dedicated research systems, such as Strelka, but also you can use many of them on your local system, and if needed you can export environments and recreate them elsewhere, seamlessly. This is a best practice for reproducible science and is a great habit to develop!

Share this: