On the Responsible use of Pseudo-Random Number Generators in Scientific Research

Introduction

Are you conducting academic research with a computational tool which producues a different outcome each time? Great; you're then probably familiar with the idea of setting a 'seed' to instantiate the (pseudo-)random number generator which is causing this variance. Conventional wisdom (scant which it is) in the academic literature and online community forums (e.g. Stack OverFlow) generally and unreservedly advocate the setting of a single seed to consistently generate the same result regardless of who runs a script, or where the code which powers the analysis is ran from (although the latter assumes identical system dependencies).

In this work, we rally against this suggestion, and in fact argue that this is the opposite of what we should be doing. While being fully appreciative of the need for computational reproducibility as a critically important feature of the scientific record, it should in no way come at the expense of blindly assuming that the one, scalar estimand which comes out of our algorithm is invariant to the choice of the instantiating seed. Through a large number of simulations, empirically driven teaching examples, and high profile replications, we describe and showcase instead what we believe to be a far more reasonable strategy in an era of not just high performance computing, but also vastly powerful personal computers. That is: how to analyze and visualise seed variability for the sake of the scientific record.

Code

In ./src/ you can find a number of scripts that conduct our computational (re-)analysis. This includes things like a re-examination of Buffon's Needle problem, forecasting the price of Bitcoin with a Random Walk, re-examination of existing Machine Learning work to emphasize the effect of k>1-fold cross-validation, and more general inferential problems (such as the Fragile Families Challenge, the use of mvprobit in Stata, and otherwise). In ./assets/ you can find a large list of pre-specified seeds (e.g. ./assets/seed_list.txt) generated by ./src/seed_generator.py/, generated through the use of int.from_bytes(secrets.token_bytes(4), 'big'), constrained to be under 2147483647 (the largest seed value accepted by np.random()). The requiements for this project can be installed via the requirements.txt file (i.e. pip install -r requirements.txt). All of the visualisations and a description of what's being done to create the visualisations (e.g. ./figures/ outputs) can be found in a summarizing notebook at src/visualization_notebook.ipynb.

Data

Most of the code comes self-contained; it generates simulated data, or pulls data down from internet archives. In two specific examples, it is necessary to download the data:

Data to replicate the results in the Fragile Families Challenge comes from the "Replication materials for Measuring the predictability of life outcomes using a scientific mass collaboration" site on the Harvard Dataverse, available here.
Data from the Millenium Cohort Study which is necessary to replicate the paper comes from here but requires a little pre-processing as per Dr. Orben's GitHub repo related to that work.

License

This work is free. You can redistribute it and/or modify it under the terms of the MIT license. The two datasets listed above come with their own licensing conditions, and should be treatedly accordingly.

Acknowledgements

We are grateful to the extensive comments made by various people in our course of thinking about this work, not least the members of the Leverhulme Centre for Demographic Science.

If you're having issues with anything failing to run, or have comments about where seeds are applicable in your own workstreams, please don't hesitate to raise an issue or otherwise get in contact!

Other areas where we seeds to be important but not included for analysis here:

Exponential Random Graph Models
Bayesian Factor Analysis (and other Bayesian frameworks more generally)
Some simple KNN type of thing.