Reproducibility definition

Last updated on 2025-06-24 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • Define reproducibility.
  • What are various kinds or degrees of reproducibility?
  • Why are Jupyter and Knitr notebooks not sufficient by themselves for reproducibility?

Objectives

  • To articulate the definition of reproducibility in the domain of computational science.
  • To assess the kind of reproducibility offered by tools you already know of.

Definitions

Reproducibility according to the ACM means: the ability to obtain a measurement with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials.

This definition derives from metrology, the science of measuring, and we will specialize some of the terms “measurement”, “measurement procedure”, and “operating conditions” in software context for computational science experiments (from here on, CSEs).

Measurement procedure (for CSEs): the application code and user-interactions required to run the application code (if any).

Operating conditions (for CSEs): a set of conditions the computer system must implement in order to run a certain program. E.g., a POSIX environment, GCC 12 compiled in /usr/bin/gcc with these flags.

For every CSE, there are probably some operating conditions in which it is the measurement can be made. To make our definition not vacuous, “reproducibility” will require all relevant operating conditions have to be documented (e.g., “README.md states you must have GCC 12 in /usr/bin/gcc”). Operating conditions can be eliminated by moving them to the measurement procedure (e.g., the program itself contains a copy of GCC 12). For the purposes of this lesson, the operating conditions are the “manual” steps that the user has to take in order to use the measurement procedure to make the measurement. One may over-specify operating conditions without changing the reproducibility status (e.g., one might say their software requires GCC 12, but really GCC 13 would work); it is quite difficult and often not necessary to know the minimal set of operating conditions, so in practice, we usually have an larger-than-necessary set of operating conditions.

Often in UNIX-like systems, the only relevant conditions are certain objects be on certain “load paths”, specified by environment variables. E.g., have Python 3.11 in $PATH, have Numpy 1.26 in $PYTHONPATH, and have a BLAS library in $LIBRARY_PATH. In such cases, it doesn’t matter where those programs are on disk; the only relevant “operating condition” is that the environment variables are set to point to those programs at a compatible version.

Measurement (for CSEs):

  • Crash-freedom: program produces result without crashing.
  • Bitwise equivalence: output files and streams are identical.
  • Statistical equivalence: overlapping confidence intervals, etc.
  • Inferential equivalence: whether the inference is supported by the output.
  • Others: domain-specific measurements/equivalences (e.g., XML-equivalence ignores order of attribtues)

In general, it is difficult to find a measurement that is both easy to assess and scientifically meaningful.

Measurement Easy to assess Scientifically meaningful
Crash-freedom Yes; does it crash? Too lenient; could be no-crash but completely opposite result
Bitwise equivalence Yes Too strict; could be off by one decimal point
Statistical equivalence Maybe; need to know output format Maybe; need to know which statistics can be off
Inferential equivalence No; need domain experts to argue about it Yes

We explicitly define reproducibility because not everyone uses the ACM definition. Reproducible Builds and Google’s Building Secure and Reliable Systems uses bit-wise equivalence only. Operating on different definitions without realizing it leads to disagreements. Defining reproducibility with respect to a measurement and operating conditions is more useful; one can refer to different kinds and degrees of reproducibility in different conditions.

Composition of measurement procedures: The outcome of one measurement may be the input to another measurement procedure. This can happen in CSEs as well as in physical experiments. In physical experiments, one may use a device to calibrate (measure) some other device, and use that other device to measure some scientific phenomenon. Likewise, In CSE, the output of compilation may be used as the input to another CSE. One can measure a number of relevant properties of the result of a software compilation.

Compilation measurement Definition
Source equivalence Compilation used exactly this set of source code as input
Behavioral equivalence The resulting binary has the same behavior as some other one
Bit-wise equivalence As before, the binary is exactly the same as some other one

E.g., suppose one runs gcc main.c on two systems and one system uses a different version of unistd.h, which is #included by main.c. The process (running gcc main.c) does not reproduce source-equivalent binaries, but it might reproduce behavior-equivalent binaries or bit-wise equivalent binaries (depending on how different unistd.h).

Related terms

Replicability and repeatability are similar to reproducibility, with the only difference whether the team is the same or different than the original and whether the measurement procedure is the same or different as the original.

Term Team M. Proc Example
Repeatability Same Same I can run ./script twice and get the same result
Reproducibility Diff Same ./script on my machine and ./script on your machine give the same result
Replicability Diff Diff ./script and a new script that you wrote for the same task on your machine give the same result

Replicability is of course the goal of scientific experiments, because the measurement can be made in different ways but still give consistent results. However, replicability involves re-doing some or all of the work, so it is expensive to pursue in practice. Therefore, we examine repeatability and reproducibility instead.

Computational provenance is a record of the process by which a particular computational artifact was generated (retrospective) or could potentially be generated (prospective)1. Provenance of a final artifact may include the provenance of the artifacts used by the final process, and so on, the artifacts used to generate the artifacts used to generate (and so on…) the final artifact.

Provenance is related to reproducibility because sufficiently detailed provenance (prospective or retrospective) may allow other users to reproduce the process within some equivalence.

Although there are many definitions of computational workflows, the most related definition is a program written in a language that explicitly represents a loosely-coupled, coarse-grained dataflow graph. “Loosely-coupled” and “coarse-grained” has some wiggle room, but usually each node represents an isolated process or container and the edges are usually files or parameters, for example, GNU Make and Snakemake. Workflows improve reproducibility by automating manual commands thereby documenting the measurement procedure for others. Often workflows are the alternative to projects which previously run a pile of scripts in a specific order known only to the developers. Workflows can be seen as a form prospective provenance and workflow engines are a natural place to emit prospective and retrospective provenance. Workflows are not necessary for reproducibility, so long as the relevant measurement procedure is otherwise documented or automated, and they are sufficient only when the workflow language is sufficiently detailed (i.e., detailed enough to capture all of the relevant operating conditions).

Literate programming is a medium where a program’s source code is embedded in an explanation of how it works2. Systems that facilitating literate programming (from here on, literate programming systems) often also permit the user to execute snippets of their code and embed the result automatically in a report. Programs written in literate programming systems with cell execution used in data science are sometimes called “notebooks”, due to their resemblance to a lab notebook. Jupyter, Knitr, and Quarto are examples of literate programming system with cell execution. These are often discussed in the context of reproducibility (See 1, 2, 3, 4, 5). Like workflows, notebooks automate procedures that may otherwise be manual (thereby documenting them for others); they can even contain the output of the code, so one can verify if their result is the similar to the authors. Also like workflows, notebooks are not sufficient by themselves for reproducibility, although they can be a valuable part of a reproducible artifact; there are often necessary operating conditions (must have Numpy version 1.26; must have data.csv) that are documented outside of the notebook or not documented at all. Since notebooks support cell-execution triggered by a UI element, the cells can be executed in a manual order, which becomes an undocumented part of the measurement procedure.

Containers and virtual machines (VMs) are related to reproducibility. These will be discussed in depth in a future episode.