Reproducibility
If you’ve used GerryChain to do some analysis or research, you may want to ensure that your analysis is completely repeatable by anyone else on their own computer. This guide will walk you through the steps required to make that possible.
Make your chains speedily replayable
It is sometimes desirable to allow others to reproduce or “replay” your chain runs step
by step. In such cirucmstances, we recommend using pcompress which efficiently and
rapidly stores your MCMC chain runs in a highly-compressed format. It can then be
quickly read-in by pcompress at a later date. To setup pcompress, you need to first
install Cargo. Then, you can install pcompress by installing running cargo
install pcompress and pip install pcompress in your terminal.
To use pcompress, you can wrap your MarkovChain instances with Record and
pass along the file name you want to save your chain as. For example, this will save
your chain run as saved-run.chain:
from gerrychain import MarkovChain
from pcompress import Record
chain = MarkovChain(
# chain setup here
)
for partition in Record(chain, "saved-run.chain"):
# normal chain stuff here
Then, if you want to replay your chain run, you can select the same filename and pass along the graph that was used to generate the chain, along with any updaters that are needed:
from pcompress import Replay
for partition in Replay(graph, "saved-run.chain", updaters=my_updaters):
# normal chain stuff here
The two code samples provided will produce totally equivalent chain runs, up to reordering. Each step in the replayed chain run will match each step in the recorded chain run. Furthermore, the replay process will be faster than the original chain running process and is compatible across future and past releases of GerryChain.
Use the same versions of all of your dependencies
You will want to make sure that anyone who tries to repeat your analysis by running your code will have the exact same versions of all of the software and packages that you use, including the same version of Python.
The easiest way to do this is to use conda to manage all of your dependencies.
You can use conda to export an environment.yml file that anyone can use to replicate your
environment by running the command conda env create -f environment.yml. For instructions on
how to do this, see Sharing your environment and Creating an environment from an environment.yml file
in the conda documentation.
If you’ve published your code on GitHub, it is a good idea to include your environment.yml
file in the root folder of your code repository.
Making Your Environment Reproducible
If you are working on a project wherein you would like to ensure particuluar runs are reproducible, it is necessary to invoke
MacOS/Linux:
export PYTHONHASHSEED=0Windows:
PowerShell
$env:PYTHONHASHSEED=0Command Prompt
set PYTHONHASHSEED=0
before running your code. This will ensure that the hash seed is deterministic which is important for the replication of spanning trees accross your runs. If you would prefer to not have to do this every time, then you need to modify the activation script for the virtual environment. Again, this is different depending on your operating system:
MacOS/Linux: Open the file
.venv/bin/activatelocated in your working directory using your favorite text editor and add the lineexport PYTHONHASHSEED=0after theexport PATHcommand. So you should see something like:_OLD_VIRTUAL_PATH="$PATH" PATH="$VIRTUAL_ENV/Scripts:$PATH" export PATH export PYTHONHASHSEED=0
Then, verify that the hash seed is set to 0 in your Python environment by running
pythonfrom the command line and typingimport os; print(os.environ['PYTHONHASHSEED']).Windows: To be safe, you will need to modify 3 files within your virtual environment:
.venv\Scripts\activate: Add the lineexport PYTHONHASHSEED=0after theexport PATHcommand. So you should see something like:_OLD_VIRTUAL_PATH="$PATH" PATH="$VIRTUAL_ENV/Scripts:$PATH" export PATH export PYTHONHASHSEED=0
.venv\Scripts\activate.bat: Add the lineset PYTHONHASHSEED=0to the end of the file. So you should see something like:if defined _OLD_VIRTUAL_PATH set PATH=%_OLD_VIRTUAL_PATH% if not defined _OLD_VIRTUAL_PATH set _OLD_VIRTUAL_PATH=%PATH% set PATH=%VIRTUAL_ENV%\Scripts;%PATH% rem set VIRTUAL_ENV_PROMPT=(.venv) set PYTHONHASHSEED=0
.venv\Scripts\Activate.ps1: Add the line$env:PYTHONHASHSEED=0to the end of the before the signature bolck. So you should see something like:# Add the venv to the PATH Copy-Item -Path Env:PATH -Destination Env:_OLD_VIRTUAL_PATH $Env:PATH = "$VenvExecDir$([System.IO.Path]::PathSeparator)$Env:PATH" $env:PYTHONHASHSEED=0 # SIG # Begin signature block
After you have made these changes, verify that the hash seed is set to 0 in your
Python environment by running python from the command line and typing
import os; print(os.environ['PYTHONHASHSEED']) in the Python prompt.
A Note on Jupyter
If you are using a jupyter notebook, you will need to make sure that you have
installed the ipykernel package in your virtual environment as well as
either jupyternotebook or jupyterlab. To install these packages, run
pip install <package-name> from the command line. Then, to use the virtual
python environment in your jupyter notebook, you need to invoke
jupyter notebook
or
jupyter lab
from the command line of your working directory while your virtual environment is activated. This will open a jupyter notebook in your default browser. You may then check that the hash seed is set to 0 by running the following code in a cell of your notebook:
import os
print(os.environ['PYTHONHASHSEED'])
Of course, once this is all done, it would be a good idea to save the random seed that you used somewhere so that others may replicate your work in the future.