python-ml-template

Boilerplate for ML projects in Python, intended for ease of access to use in dev and production.

The intended goal behind this project structure is to establish recommended practices for developing and deploying ML projects that have an active experimentation component to them.

1. Team sharing

Code that is built to be shared among team members and beyond. All code should be designed from the very start to be shared, as in combination with version control, provides an excellent mechanism for documenting changes made to the project over time.

2. Ease of Dev-to-Prod Deployment

Building codebases that allow for easy deployment of code to a production environment is very difficult. Often projects have to struggle between two difficult extremes:

Focus on speedy local development, making production deployments difficult and unpredictable, hence less frequent. It is especially true for Data Science that customer input is crucial to the development process. Thus, frequent and reliable deployments are necessary to ensure the highest quality feedback is coming back to the development team. When projects are eventually migrated into production, it is done by another team, which means that the creators lose ownership of their product.
Some teams prefer to use environments that manage the entire deployment process for the user. However, this can often be overly restrictive for fast-paced teams. Managed environments often are feature-limited, and extending these requirements requires either extensive work or may require a feature request to the team managing that environment. Complex projects that require robust deployment pipelines or unique constraints often cannot advance far within these tools. Additionally, these tools are attempting to compete with a gigantic ecosystem of local development tools (IDEs, testing environments, custom Dockerized workflows) that have significantly larger teams backing each component.
Finally, there are others that tend to work in copies of production which do not have all of the feature offerings of a local development environment. For example, testing ETL pipelines by directly editing jobs in AWS Glue rather than testing Spark code locally, or testing a serverless Lambda by deploying and then testing endpoints. These testing patterns tend to slow down the development process.

This boilerplate is intended to provide some practices that can make dev-to-prod deployment easier. Additionally, the boilerplate provides a layer of standardization among projects, thus reducing complexity when supporting deployed applications.

3. Extensible

The tools and tips provided here are meant to provide some base tools that can be extended to build more complex workflows. For example, this workflow could be used to deploy projects onto a Databricks cluster via a Docker image, or to serve one or more Lambdas, or to be deployed to Kubernetes. The goal isn't to directly suggest production workflows, but to eliminate as many constraints and restrictions that arise in dev environments to make our dev and prod environments as similar as possible.

Setup

Clone this repository with

git clone git@github.com:chadac/python-ml-template.git
cd python-ml-template

Create a new Python virtualenv with pyenv, and set the default Python environment for this project directory:

pyenv install 3.8.3
pyenv virtualenv 3.8.3 python-ml-template
pyenv local python-ml-template

Finally, install the dependencies from this project.

pip install poetry
poetry install

You can test that it has succeeded with:

python scripts/say-hello-world.py

Useful Tools

pyenv for Python Version Management

pyenv is a utility for managing multiple Python versions (and virtualenvs!) on one system. Depending on the system, managing multiple versions of Python can be difficult. As such, this tool simply moves the Python versions into your user folder (so your changes don't conflict with others on the same system) and allows seamless swapping between versions.

It also provides a great utility for managing virtualenvs -- pyenv virtualenv <python-version> <virtualenv-name> creates a new virtualenv with the given name, and pyenv local <virtualenv-name> sets up your terminal to automatically activate this virtualenv when entering whatever directory this command was run in.

To delete an existing virtual environment, simply run pyenv virtualenv-delete <virtualenv-name>.

Poetry for Dependency Management

Poetry acts as an extra layer on top of pip, giving you the ability to manage your project dependencies while also tracking these dependencies under version control, within the pyproject.toml file. We have found that this package is significantly easier to work with rather than manually managing a requirements.txt file or using Pipenv, which tends to be a bit slow when resolving any dependencies. Poetry also integrates nicely when working with custom containers by simply adding poetry install command in the Dockerfile. Finally, poetry provides the ability to install optional dependencies by running poetry install -E <name of extras> (Note: dependencies are defined under [tool.poetry.extras] in the pyproject.toml file). This becomes handy when working with multiple libraries, especially in a development environment. Also, the extras flag can be passed for specific tasks such as unit tests. Optional dependencies are defined in the pyproject.toml with an optional = true parameter set for each library.

Code Quality Checks

autopep8 provides a convenient way to automatically format Python code in order to conform to the Pep8 standards. This additional step of formatting source code helps ensure uniformity amongst the Data Scientist team in regards to coding practices and standards.

flake8 is an easy-to-use Python library to identify syntactical and stylistic problems in your source code. Here we recommend running flake8 to identify these errors beforehand and additional mechanism in-place to mitigate potential errors arising in a production environment.

autopep8 --in-place --aggressive --recursive <path>

flake8 <path>

Note: please refer to the autopep8 documentation for a list of optional arguments.

Unit Tests

pytest is testing framework based in Python. Writing and maintaining tests is not an easy task, however pytest helps make testing code more productive and less painful. To get started running your tests, first the testing scripts reside in the tests directory within the project root. Secondly, pytest expects the files name to being with test_ or end with _test.py. Here, we've adopted the naming convention test_<name-of-task>.py.

pytest-cov is a convenient plugin to produce coverage results of your unit tests. Here, we can measure the percentage of the source code that the tests cover.

To run all test scripts within the test directory, execute the following command:

pytest --cov=scripts tests/

Continuous Integration

Testing your source code prior to deployment is crucial in order to prevent errors being propagated into production. Therefore, as part of the ML workflow, we need a continuous integration platform which can automatically build and test code changes with immediate feedback on the status of the build. Here we are defining a build as a group of jobs that run in series. And a job is essentially an automated step that clones the code repository into a virtual env and then performs a series of tasks.

There are a number of platforms to enable the continuous integration stage. In this template, we've integrated Travis CI as part of the workflow. Travis CI can easily integrate with common repositories like GitHub and is relatively straightforward to get started with. Also for non-concurrent jobs, builds for open-source projects are free of charge.

The integration pipeline is configured to only run if code changes are being merged into the master branch. Thus, once a Pull Request is raised, the build will kick-off and the status of the run will be provided in the PR window opened in GitHub. Travis CI provides a convenient dashboard to track runs and access logs.

The Travis CI build will run the .travis.yml located in the project repository which contains a series of commands to execute. In this file, we've defined the operating system and environment we will be using in the build stage. Next, we will need to define the dependencies to install using poetry and then finally, we will run unit tests with pytests along with autopep8 and flake for validating the quality of the code.

Tips & Tricks

Running code as a module

The most common command for running any Python script is usually:

python script.py

The disadvantage of this approach is that if you prefer to build a more complex project with multiple module dependencies, importing code can be difficult. The most common approaches to deal with this issue is to either modify sys.path to include any desired separate folders with modules, or placing the scripts in the root of the repository and storing project-specific code within subfolders. These approaches have several disadvantages:

Both modifying sys.path and placing the scripts at the root of your project hard-codes dependencies, making it harder to reproduce in production. Allowing calling the Python package from any location makes it much easier to clone into any environment.
In some cases, the sys.path must be modified to an absolute path, meaning that sharing the code between people is difficult.
Finally, storing any entrypoints at the root of the repository can create an unreadable mess, or create an unnecessary number of COPY commands when cloning only what is needed to production. Keeping scripts within their own folder cleans up the root and helps others identify where the primary entrypoints to the application lie.

Installing editable modules

By default, any scripts are usually not installed on your PYTHONPATH as modules, hence why the standard approach is to modify the sys.path within the code to deal with this issue. However, you can actually set up pip to install your modules as packages in an editable mode so they can be accessed in any location.

The standard approach is to use pip install -e . in a folder where you have a setup.py file created. The setup.py usually specifies metadata about a project -- where the project files may be loaded, what the project is called, and so on.

With Poetry, however, you do not have to worry about this. pyproject.toml is meant to act as a complete replacement for the standard setup.py file. Poetry also automatically installs your project package in editable mode, meaning that you will be able to access your Python code from anywhere after running poetry install. The only requirements are:

You must have a folder with the name specified in your pyproject.toml file under the tool.poetry.name property. Make sure that this is a Python clean name (no dashes, etc). You can directly modify pyproject.toml whenever needed.
Once the folder is created, run poetry install for poetry to install the module in editable mode. It should have a message specifying that it has installed your module.

This repository includes an example of such usage --

python scripts/say-hello-world.py

which runs the function utils.say_hello_world from the python_ml_template package.

Publishing Package

In some cases, you may need to install a custom Python package into your project. Poetry makes packaging relatively straightforward to publish a library to PyPI or a private repository. Before you can publish the library, you will need to first package it by running poetry build which builds the source and wheels archives. Next you will need to configure the credentials to the repository with poetry config. Please see the documentation here for further details on configuration step. Finally, poetry publish will publish the package to the remote repository.

mleng-shared/python-ml-template