Last updated on 2025-10-28 | Edit this page

Estimated time: 12 minutes

Overview

Questions

Define reproducibility.
What are various kinds or degrees of reproducibility?
Why are Jupyter and Knitr notebooks not sufficient by themselves for reproducibility?

Objectives

To articulate the definition of reproducibility in the domain of computational science.
To assess the kind of reproducibility offered by tools you already know of.

Definitions

Reproducibility according to the ACM means: the ability to obtain a measurement with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials.

This definition derives from metrology, the science of measuring, and we will specialize some of the terms “measurement”, “measurement procedure”, and “operating conditions” in software context for computational science experiments (from here on, CSEs).

Measurement procedure (for CSEs): the application code and user-interactions required to run the application code (if any).

Operating conditions (for CSEs): a set of conditions the computer system must implement in order to run a certain program. E.g., a POSIX environment, GCC 12 compiled in /usr/bin/gcc with these flags.

For every CSE, there are probably some operating conditions in which it is the measurement can be made. To make our definition not vacuous, “reproducibility” will require all relevant operating conditions have to be documented (e.g., “README.md states you must have GCC 12 in /usr/bin/gcc”). Operating conditions can be eliminated by moving them to the measurement procedure (e.g., the program itself contains a copy of GCC 12). For the purposes of this lesson, the operating conditions are the “manual” steps that the user has to take in order to use the measurement procedure to make the measurement. One may over-specify operating conditions without changing the reproducibility status (e.g., one might say their software requires GCC 12, but really GCC 13 would work); it is quite difficult and often not necessary to know the minimal set of operating conditions, so in practice, we usually have an larger-than-necessary set of operating conditions.

Often in UNIX-like systems, the only relevant conditions are certain objects be on certain “load paths”, specified by environment variables. E.g., have Python 3.11 in $PATH, have Numpy 1.26 in $PYTHONPATH, and have a BLAS library in $LIBRARY_PATH. In such cases, it doesn’t matter where those programs are on disk; the only relevant “operating condition” is that the environment variables are set to point to those programs at a compatible version.

Measurement (for CSEs):

Crash-freedom: program produces result without crashing.
Bitwise equivalence: output files and streams are identical.
Statistical equivalence: overlapping confidence intervals, etc.
Inferential equivalence: whether the inference is supported by the output.
Others: domain-specific measurements/equivalences (e.g., XML-equivalence ignores order of attribtues)

In general, it is difficult to find a measurement that is both easy to assess and scientifically meaningful.

Measurement	Easy to assess	Scientifically meaningful
Crash-freedom	Yes; does it crash?	Too lenient; could be no-crash but completely opposite result
Bitwise equivalence	Yes	Too strict; could be off by one decimal point
Statistical equivalence	Maybe; need to know output format	Maybe; need to know which statistics can be off
Inferential equivalence	No; need domain experts to argue about it	Yes

We explicitly define reproducibility because not everyone uses the ACM definition. Reproducible Builds and Google’s Building Secure and Reliable Systems uses bit-wise equivalence only. Operating on different definitions without realizing it leads to disagreements. Defining reproducibility with respect to a measurement and operating conditions is more useful; one can refer to different kinds and degrees of reproducibility in different conditions.

Composition of measurement procedures: The outcome of one measurement may be the input to another measurement procedure. This can happen in CSEs as well as in physical experiments. In physical experiments, one may use a device to calibrate (measure) some other device, and use that other device to measure some scientific phenomenon. Likewise, In CSE, the output of compilation may be used as the input to another CSE. One can measure a number of relevant properties of the result of a software compilation.

Compilation measurement	Definition
Source equivalence	Compilation used exactly this set of source code as input
Behavioral equivalence	The resulting binary has the same behavior as some other one
Bit-wise equivalence	As before, the binary is exactly the same as some other one

E.g., suppose one runs gcc main.c on two systems and one system uses a different version of unistd.h, which is #included by main.c. The process (running gcc main.c) does not reproduce source-equivalent binaries, but it might reproduce behavior-equivalent binaries or bit-wise equivalent binaries (depending on how different unistd.h).

Replicability and repeatability are similar to reproducibility, with the only difference whether the team is the same or different than the original and whether the measurement procedure is the same or different as the original.

Term	Team	M. Proc	Example
Repeatability	Same	Same	I can run ./script twice and get the same result
Reproducibility	Diff	Same	./script on my machine and ./script on your machine give the same result
Replicability	Diff	Diff	./script and a new script that you wrote for the same task on your machine give the same result

Replicability is of course the goal of scientific experiments, because the measurement can be made in different ways but still give consistent results. However, replicability involves re-doing some or all of the work, so it is expensive to pursue in practice. Therefore, we examine repeatability and reproducibility instead.

Computational provenance is a record of the process by which a particular computational artifact was generated (retrospective) or could potentially be generated (prospective)¹. Provenance of a final artifact may include the provenance of the artifacts used by the final process, and so on, the artifacts used to generate the artifacts used to generate (and so on…) the final artifact.

Provenance is related to reproducibility because sufficiently detailed provenance (prospective or retrospective) may allow other users to reproduce the process within some equivalence.

Although there are many definitions of computational workflows, the most related definition is a program written in a language that explicitly represents a loosely-coupled, coarse-grained dataflow graph. “Loosely-coupled” and “coarse-grained” has some wiggle room, but usually each node represents an isolated process or container and the edges are usually files or parameters, for example, GNU Make and Snakemake. Workflows improve reproducibility by automating manual commands thereby documenting the measurement procedure for others. Often workflows are the alternative to projects which previously run a pile of scripts in a specific order known only to the developers. Workflows can be seen as a form prospective provenance and workflow engines are a natural place to emit prospective and retrospective provenance. Workflows are not necessary for reproducibility, so long as the relevant measurement procedure is otherwise documented or automated, and they are sufficient only when the workflow language is sufficiently detailed (i.e., detailed enough to capture all of the relevant operating conditions).

Literate programming is a medium where a program’s source code is embedded in an explanation of how it works². Systems that facilitating literate programming (from here on, literate programming systems) often also permit the user to execute snippets of their code and embed the result automatically in a report. Programs written in literate programming systems with cell execution used in data science are sometimes called “notebooks”, due to their resemblance to a lab notebook. Jupyter, Knitr, and Quarto are examples of literate programming system with cell execution. These are often discussed in the context of reproducibility (See 1, 2, 3, 4, 5). Like workflows, notebooks automate procedures that may otherwise be manual (thereby documenting them for others); they can even contain the output of the code, so one can verify if their result is the similar to the authors. Also like workflows, notebooks are not sufficient by themselves for reproducibility, although they can be a valuable part of a reproducible artifact; there are often necessary operating conditions (must have Numpy version 1.26; must have data.csv) that are documented outside of the notebook or not documented at all. Since notebooks support cell-execution triggered by a UI element, the cells can be executed in a manual order, which becomes an undocumented part of the measurement procedure.

Containers and virtual machines (VMs) are related to reproducibility. These will be discussed in depth in a future episode.

Content from Package management theory

Last updated on 2025-10-28 | Edit this page

Estimated time: 12 minutes

Overview

Questions

TODO

Objectives

To explain the terms: from-source package manager, binary package manager, single-version package repository, multiple-version package repository, and the diamond dependency problem
To synthesize the pros/cons of from-binary vs from-source and single-coherent-set vs SAT-solving

A package manager is a program that manages (installs, upgrades, removes) a given target software and the target’s software dependencies.

There are several distinguishing features of package managers: 1. whether they install packages by compiling the packages from source or downloading binary files 2. whether they attempt to solve dependency 3. whether they operate on a local or system-wide environment

From-source or binary

Binary package managers install software by downloading pre-compiled files and placing them in certain paths.

From-source package managers install software primarily by downloading source code, compiling it, and placing it in certain paths. However, from-source package managers may optionally use a binary cache, where they can download a pre-compiled binary; the salient difference between from-source with binary cache package managers and binary-only package managers is that the former can always “fall back” on compiling the source when the binaries when the cache can’t be used for an arbitrary reason.

This design choice influences the following qualities:

Speed: downloading pre-compiled binaries is often faster (assuming the network is fast enough relative to the slowness of compilation). However binary caches speed up from-source package managers in common cases¹.
Security: users of binary package managers implicitly trust the package repository storage/delivery and the compiler stack of the maintainer who compiled the package in the first place. However, organizations may audit builds can move trust from the maintainer to the auditor (which may be the end-user). From-source package managers implicitly trust the package repository storage/delivery and compiler stack on the user’s system.
Flexibility: compilation may to include certain assumptions about the runtime system (e.g., dynamic library RUNPATH) or certain extensions (e.g., --disable-ipv6). Users of binary package managers are subject to whatever compile-time options the maintainer used. They can alleviate this by offering multiple variants of the binary at greater storage cost. From-source package managers make it much easier to automatically or manually change compile-time options.
Infrastructure cost: binary packages are expensive to store. A binary is specific to a CPU instruction-set architecture (ISA) (e.g., x86 64-bit and ARMv8-A), operating system kernel² (e.g., Linux, MacOS, and Windows), and version of the software. Each package manager has a different binary format, be it DEB for APT, RPM for YUM/DNF, Wheels for Pip, etc., so reuse between package managers is rare. Meanwhile, the version-controlled source-code is shared by users of all CPU ISAs, all OS kernels, all versions³, and all from-source package managers⁴.

In my opinion, the most important factor is infrastructure cost. Storing binaries for every ISA, OS, and software version is expensive, so nobody does it for free forever. For example, APT depends on volunteers or organizations to host repositories for free but not forever; ftp.debian.org has nothing older than Debian 10.x (2019 – expected 2024) at the time of writing. Software environments based on Debian 9.x (new in 2017) may not be reproducible today. The same problems exist for Ubuntu, Fedora, CentOS, and even multi-platform binary package managers like Conda.⁵

See caveats about Docker⁶; I will use it here to demonstrate “starting from a clean install and running one command.”

FROM debian:9
RUN apt-get update
# At the time of writing, this Dockerfile will fail at this step.

On the other hand, storing the source code is much less expensive, and Software Heritage committed to storing the source code of important open-source infrastructure for free for a long time.⁷

Many software projects not only release the source code, they often allow users to modify the source code. This makes their software more debuggable, repairable, and extensible; if it breaks, users can fix it themselves. However, the benefits of open-source are not realized if there is more friction in installing from source than installing from binary. Users would have to locate the source code, download it, learn its build system, find the package manager’s build recipe for it, learn the package manager’s packaging system, build the binary package, serve the binary from a package repository, add the repository to their package managers sources list, and install it. When installing from binaries is the common-case, installing from source may become neglected and flaky.

However, the process of compiling software has its own operating conditions necessary for equivalent-behavior reproducibility⁸ (e.g., “requires GCC 12”). But who compiles the compiler? The operating conditions of the package manager itself can be thought of as a “trusted computing base”.

A trusted computing base (TCB) is a term from computer security for a small set of software that security analysts assume correctly implements a certain contract, which is usually kept dead-simple. They develop guarantees of the form “assuming the TCB is correct to its TCB contract, this high-level software correctly implements this high-level contract”. Package managers think of TCB as a small set of software that is “trusted to be reproducible” rather than “trusted to be secure”. As in computer security, package managers should strive to keep their TCB as easy to reproduce as possible, either by making it small or making a good installer that enforces the operating conditions. ⁹

One may try to pre-compile binary packages for all relevant platforms, but one simply can’t predict what ISAs their users will have in the future. The set of popular ISAs is in constant flux: - Itanium (2001 – 2021) processors are no longer purchasable - SPARC (1987 – 2029 expected) is planned to go that way too - PowerPC (2006) is exceedingly rare for desktop consumers at the time of writing - x86 64-bit (2000) and ARMv8-A (2011) have soared (for ARM, due to the popularity of the Raspberry Pi and Apple M-series), with backwards compatibility for ARMv7-A (2007) and x86 32-bit (1985) - But even the hegemony of x86-64 is not certain, given Intel’s proposed x86S (2023).

For these reasons, I suggest using a from-source package manager for long-term reproducibility.

TODO: table

Dependency Constraint solving or not

Every package has a version, which are ordered in sequence¹⁰.

A dependency constraint occurs when operating some versions of A requires operating B (operate usually means compile or run). Since the packages have multiple versions, each version of A may depend on having one of a set of versions of B. Some versions of A may not depend on B at all, but so long as other versions of A do, we will say A has a dependency constraint on B. Usually, dependency constraints map a range of versions of A to a range of versions of B, where a range is an upper- and lower-bound, but there can be . E.g., (Numba 0.58 requires Python at least 3.8 and at most 3.11; Numba 0.55 requires Python at least 3.7 and at most 3.10; …). If the dependency constraint is violated, operating the package invokes undefined behavior.

A Diamond dependency is a set of packages and dependency constraints where A depends on B and C, B and C depend on D, and the user wants to install A¹¹.

A consistent set is a mapping from packages to versions, satisfying dependency constraints.

Dependency constraint solving is the process of finding a consistent set. If the user asks for Python and Numba, package version resolution may select Numba==0.58 and Python==3.11.

Dependency constraint solving for a diamond dependency is particularly tricky. If one selects a candidate for A (perhaps defaulting to the latest), and then selects compatible candidates for B and C. However, those candidate versions of B and C may require different versions of D. One may try different candidates for B and C that are compatible with our selected candidate for A, but perhaps none of these will work. In that case, one has to “backtrack” even further and select a new candidate for A. A itself may have been the downstream package of another diamond dependency, so one may have to backtrack on those candidates, and so on.

In general, dependency constraint solving is NP-complete¹² due to the necessity of backtracking in diamond dependencies. Oversimplify the details, NP-completeness means the runtime of the fastest known¹³ algorithm scales exponentially¹⁴ with the input.

Dependency constraint solving is often implemented as a call to an external boolean satisfiability (SAT) solver). Since they are used in many problems, SAT solvers have been optimized for decades; their runtime is still exponential but a lesser exponential than home-built solutions.

Some package managers attempt to reduce the complexity of solving by the following strategies:

Reducing the number of candidate versions by using fewer package repositories (Conda’s strict package priority and Conda meta-channel).
Ordering the candidates such that the correct solve is found earlier (reordering Conda channels).
NodeJS has a notion of “public dependencies” (called “peer dependencies” to NodeJS developers) and “private dependencies” (called “dependencies” to NodeJS developers). Private dependencies don’t need to be consistent with each other, so the more dependencies that can be treated as private dependencies rather than public dependencies the simpler the dependency graph becomes. When B and C privately depend on D, require("D") from B can loads a different version than require("D") from C (see Understanding the npm dependency model by King on their blog 2016). However, if types or global variables from one package end up in the public interface of another, it has to be a public dependency.

Nix can also make dependencies “private” in some cases. If B and C require D to be on the $PATH, Nix can replace B with a wrapper script that adds a specific version of D to the $PATH and executes the underlying B program, and likewise for C. No copy of D needs to be on the global $PATH; the right version of D is always put on the $PATH “just in time” for execution of B or C, and the version of D for B does not interfere with the version of D for C.
Package repositories contain few or only one version of each package (APT/apt-get). This reduces the number of candidates to search through. However, the community has to ensure that a the repository contains a consistent set for every set of packages.

Therefore, some package managers avoid the package version resolution altogether using one of the following strategies.

Let the user solve dependency constraints themselves (e.g., Pip before version 20.3). This is only practical when the version constraints are loose; i.e., the newest version of A will work with any new-ish version of B.
Only allow a single version of every package from a manually-maintained consistent set (e.g., Haskell Stackage).
When Maven encounters a diamond dependency, either B or C, whichever is closer to the root, gets its preferred version of D and the other constraint gets ignored. The user often specifies the exact version of D themselves to ensure correctness.

Local or system-wide

Package managers either install to a local or a system-wide software environment.

System-wide software environment package managers are necessary for several reasons:

Hardcoded paths: Some software uses hardcoded, unchangeable paths to root-owned directories, for example /usr/lib. Most software today does not have hardcoded paths, and so long as their source code is available and licensed for modifications, it isn’t unchangeable. Nevertheless, this issue was more important when package managers were designed.¹⁶
System-wide effect: Some software has system-level effects or uses, e.g., the bootloader, the kernel, drivers, and login manager are installed using a system-wide package manager.
Save space: Historically, disk space was expensive, so system administrators (sysadmins) did not want users compiling the same package just for themselves. Rather, the sysadmins installed one for everyone to a global location upon request.

On the other hand, local environment package managers have the following benefits:

Reduce dependency conflicts: Suppose project A have conflicting dependencies with project B (unsatisfiable), but if they both have their own local environment, this is no bother. Local environments naturally have fewer packages than their system-wide union, minimizing the time needed to SAT-solve.
Principle of least privilege: Normal users can manage software environments without superuser privilege.

Containers and virtual machines are one way of using system-wide package managers in a local-environment.

Conclusion

TODO: APT isn’t really snapshottable We suggest using from-source package managers. If your application requires specific versions, local is better

Although this may not always work out in practice; if the from-source package manager includes the compiler version and CPU family as the package spec (“key” for the cache), as Spack does (for valid performance and reproducibility reasons), the cache will often miss.↩︎
Special hacks allow multi-OS binaries, but these hacks are uncommon and fragile.↩︎
If the source-code does not use version-control, it could still use delta compression or block-based compression more efficiently than binary equivalents.↩︎
Package-manager specific patches may be stored separately, this is still much less storage cost than storing package-manager specific binaries separately, since the patched source are already delta-compressed (only store the patch) and usually small↩︎
TODO: Find examples for other pkg mgrs and make into a details snippet↩︎
TODO: intra-lesson link↩︎
TODO: find their exact commitment↩︎
TODO: link to last lesson↩︎
TODO: How does Spack compile the compiler? What about Nix? What about Guix?↩︎
For example, Spack uses this total order, Java uses this one, Python uses this one, Maven uses this one.↩︎
Note, this can occur with three packages if the user wants to install B and C; in that case we could pretend the user wanted to install a “virtual package” A, that depends on B and C, so the three-package case is reducible to the four-package case↩︎
For a formal proof, see Dependency Solving: a Separate Concern in Component Evolution Management by Abate et al. in the Journal of Systems and Software 2012; for an informal proof, see Version SAT by Cox on their blog 2016 ↩︎
If P ≠ NP, then “fastest known” can be replaced with “fastest possible”. Computer scientists have been looking for fast algorithms to NP-complete problems for decades; so far no luck. 88% of researchers polled (informally) to respond that P ≠ NP (see Guest Column: The Third P =? NP Poll by Gasarch in SIGACT News Complexity Theory Column 100 (2019)).↩︎
Mathematically astute readers will note this should read “super-polynomial” not “exponential”, but I did promise to “oversimplify the details” for a general audience.↩︎
DNF, YUM, and APT technically do dependency solving, but their package repositories often have just one version of a package dependency. Usually, only a handful of repositories are enabled. These package managers can do dependency solves, but often not efficiently because they rarely encounter large ones.↩︎
TODO: Is everything really relocatable?↩︎

Content from Environment specifications

Last updated on 2025-10-28 | Edit this page

Estimated time: 12 minutes

Overview

Questions

TODO

Objectives

TODO

Semantic versioning (also known as semver) is a convention for constructing version strings¹. There are many rules, but the gist is that versions have three numbers separated by dots (e.g., 1.0.0). Releases that are backwards-incompatible increment the first number (called the major version), while releases that add new functionality in a backwards-compatible way increment the second number (called the minor version), and bug-fix releases increment the third number (called the patch version).

Semantic versioning is related to reproducibility because it helps us specify flexible operating conditions. If packages follow Semver strictly, all package constraints between A and B would be of the form “A a.b.c requires B greater than x.y.z but less than w.0.0”^{Indeed, poetry add foo adds a constraint on foo using the newest version as x.y.z and sets w, quite sensibly, to x + 1.}. - x represents the oldest public interface that A can tolerate; prior versions would have the wrong interface. - y represents the oldest x that has all the required functionality that A uses. Prior versions would have the right interface but lack critical functionality. - z represents the newest bug-fix in x.y that A requires. Prior versions would have the right interface and functionality, but lack a bug-fix that A depends on. - w is the oldest version of the public interface that breaks A.

Flexible operating conditions are easier for the reproducer to satisfy. They are also more likely to compose with the operating conditions of other experiments without conflicting.

However, Semver is difficult to follow strictly in practice because the package author may not know what is backwards-incompatible. Hyrum’s law² states that “with a sufficient number of users of an API, all observable behaviors of your system will be depended on by somebody.”

A facetious example is given in the following XKCD comic:

The application developer thinks of their change as a bug fix, since it prevents the CPU from overheating when the space bar is pressed. However, this particular user thinks of the same change as backwards-incompatible, since it breaks their workflow that depends specifically on the CPU overheating when the space bar is pressed. A real-world example of difference of opinion on what constitutes backwards-compatibility comes from Python Cryptography when library authors added a build-dependency on Rust. Library authors considered their change forwards-compatible, since the public API did not change; however some sysadmins said this should have been a major change, since upgrading broke their systems.

Content from Containerization and virutalization

Last updated on 2025-10-28 | Edit this page

Estimated time: 12 minutes

Overview

Questions

TODO

Objectives

To explain how containers and virtual machines may be used for reproducibility

Also consider architectures

Content from Engineering Reproducible Software

Last updated on 2025-10-28 | Edit this page

Estimated time: 12 minutes

Overview

Questions

TODO

Objectives

TODO

Content from Package managers in practice

Last updated on 2025-10-28 | Edit this page

Estimated time: 12 minutes

Overview

Questions

TODO

Objectives

TODO

Overview

Questions

Objectives

Definitions

Related terms

Overview

Questions

Objectives

From-source or binary

Dependency Constraint solving or not

Local or system-wide

Conclusion

Overview

Questions

Objectives

Overview

Questions

Objectives

Overview

Questions

Objectives

Overview

Questions

Objectives