Tips and Tools for Reproducible Projects with Python

Plan for today:

project organization
using virtual environments
creating modules
building packages
setting up continuous integration
literate programming

Directory Structure

How should we structure our data science project repositories?

Example:

.
+-- data
|   +-- raw
|   +-- processed
|
+-- src
|   +-- PythonModules
|   +-- tests
|   
+-- notebooks
|   +-- exploratory
|   +-- expositionary
|
+-- references
|   +-- papers
|   +-- tutorials
|
+-- results 
+-- README.md
+-- LICENSE.txt

No one solution: adjust as the project evolves.

Comprehensive Project Templates:

Data Science Cookiecutter - Data Science Project Template
Shablona - Python Package Template

Exercise:

Reorganize the Geopandas Tutorial
- fork https://github.com/valentina-s/mobility-index
- git clone fork_name
- move all notebooks to a notebook folder
- create a mobility_index folder which will store all the code.
- create a tests folder under mobility_index which stores the tests

Distributions & Package Managers

Conda vs pip

What is Conda?

Anaconda is a Python distribution slightly different from the default Python distribution, and comes with its own package manager (conda).
Conda packages come in the form of .whl files (wheel files). They are precompiled packages: i.e. they are compiled for each specific operating system. They are fast to install. (Installing Numpy from scratch takes forever compiling C code) Miniconda is even faster to install as it is bare bones: better for deploying: have only what you need.

What is pip?

Package manager for Python. Install packages from PyPi. There are packages in pip which are not in conda.

pip install vs conda install

pip freeze

conda list

There are also additional conda packages on conda-forge. You can install them by

conda install -c conda-forge package_name

and you can build your own.

Virtual Environments

What is a Python virtual environment?

A folder with all Python executables and libraries and a link to them. Virtual environments take space!

Pure Python: virtualenv

If using anaconda distribution create envs by:

	conda create --name newEnv python=2 extra_packages

View environments:

	conda env list

On Unix:

source activate newEnv
do stuff
conda install more_packages
source deactivate

On Windows:

activate newEnv
do stuff
conda install more_package
deactivate

Saving environments:

	conda env export -f exported_env.yml

Load an environment from .yml file:

	conda env create -f exported_env.yml

You can do the same thing with pip:

	pip freeze > requirements.txt
	pip install -r requirements.txt

We can see that the list of packages is pretty long (because of the dependencies, and quite specific).

Sometimes you just want to list the ones which you need (and not specify the version). You can write create the following requirements.txt file:

	geopandas
	shapely

Make sure to install Jupyter within virtual environment

More Virtualization

Docker
Vagrant
AWS public AMIs
friend's laptop

Cross-platform Directory Paths

Make paths independent of platform and all relative to directory structure

	import os
	
	# current path
	current_path = os.getcwd()
	
	# join paths for Windows and Unix
	code_path = os.path.join(current_path, "src")
	
	# make sure paths/files exist before reading
	os.path_exists() 
	os.path.isfile()

Modules & Packages

move functions from notebooks to a module
paths for modules

reloading modules

python 2:
```
 	reload(module_name)
```

python 3:

 	from imp import reload
 	reload(module_name)

install module as a package
- create a setup.py file
- run the setup.py file
```
 	python setup.py install package_name
```
  and you will be able to import the package from anywhere!
submodules
- put __init__.py in every folder
git submodules - add external github repos to your github project

Testing

Locally

nose

 	pip install nose

For each function in library.py write a test function:

 +-- src
 |   +-- library.py
 |   +-- tests
 |       +-- test_function1.py
 |       +-- test_function2.py

Use numpy.testing module.

Example:

ArraySum.py:

 	def ArraySumFunction(array1,array2):
 	   # function which sums two arrays
 		return(array1 + array2)

testArraySum.py:

 import numpy as np
 from numpy import testing as npt
 import ArraySum

 def test_ArraySumFunction():
 	# testing ArraySum function
 	array1 = 2*np.ones(100)
 	array2 = np.ones(100)
 	res = ArraySum.ArraySumFunction(array1,array2)
 	npt.assert_equal(res, 3*np.ones(100))

Run the tests:

 	nosetests

In practice, we most probably we will forget to run nosetests after every change we make in the code, luckily, we can do it automatically using continuous integration.

Remotely:
- Travis-CI (free for public repos)
  - specification by a travis.yml
- AppVeyor (for Windows)
- CircleCI
- Wercker (based on Docker containters)
- Jenkins - need to configure it
Exercise: let's set up Travis-CI for the Mobility Index project.
- log in to Travis-CI with your github account
- search for the repository you want to activate with travis and switch it on
- write a .travis.yml specifying the build requirements and tests
Types of tests:
- unit testing
- integration testing
- regression testing
- functional testing
Test Coverage - Coveralls

Exercise(extra): explore how you can set up automatic coverage check

Editors

PyCharm - integration with GitHub
Atom - coloring in Github (extra packages)
JupyterLab (web based -> can run on server)
Spyder Matlab-like IDE

Linters
- for PEP8 style
- for errors: pyflakes
- for both: flake8

Documentation

Nbconvert - to pdf, to html
Reveal.js: Jupyter notebook -> slides (Instructions)
css styles for notebook
Sphinx, readthedocs, ... (automatically generate documentation, integrate with CI)
gh-pages - project website based on Jekyll
Binder (of notebooks) (free sharing of github jupyter notebooks)
Jupyter Hub + Kubernetes - sharing reliably with many people
SageMathCloud - CoCalc

Extra Resources

Hitchhikers Guide for packaging

aescay/mobility-index