This repo is a companion to the PyData DC 2018 presentation "Do Your Homework! Writing tests for Data Science and Stochastic Code" by David Waterman.
Clone the repo, create a new virtual environment, and then:
-
Install the package requirements:
pip install -r requirements.txt
-
Install the package in editable mode:
pip install -e .
-
Run the tests:
pytest
-
Open the generated test report
tests/test-logs/testreport.html
in your browser of choice.
https://docs.pytest.org/en/latest/goodpractices.html
https://github.com/ericmjl/data-testing-tutorial
http://engineering.pivotal.io/post/test-driven-development-for-data-science/
https://www.oreilly.com/library/view/thoughtful-machine-learning/9781449374075/ch01.html
Chapter 1 of Thoughtful Machine Learning by Matthew Kirk
https://www.oreilly.com/library/view/code-complete-second/0735619670/
At the presentation a question was asked about whether time spent testing is a good investment. I didn't have any references to support my affirmative response offhand, but the first thing that came to mind after was the book "Code Complete" by Steve McConnell. It is an excellent book overall that I would recommend to anyone looking to improve their software development process, and Chapter 22 in particular addresses testing. Going back through the book I was unable to find hard numbers on the cost savings provided by testing, but I did find some interesting and relevant data, such as that if a software defect is introduced during code construction, it is 10-25 times most costly to fix it after release than it is to find it and fix it during the construction phase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.5798
Another reference that provides some data on the value of testing. The conclusion: "Members of the target development team can expect to spend up to 100% more time implementing unit tests in conjunction with the production code being written. Improvements of up to 267% fewer defects can be achieved through the test-during-coding processes."