Data Science Project Template

This template has been built after reading the Medium article by khuyetran1401. It would be much simpler to just fork its repo but I prefer to build it by myself to understand each component. It has been built to be easy and quick to use.

For 'industrial' or more 'business' projects, I still prefer tools like Kedro.

Features and Roadmap

✅ Automatically build repository structure for DS personal projects

✅ Create and Build an environment using conda

🔲 Run Tests automatically

🔲 Manage configuration variables for data pipelines and projects

✅ Enforce hints and quality code

🔲 Automatically Document Code

🔲 Automate Code

✅ DVC for Data Management and Experiment Management

To Do

Automate setup of dvc repo and .gitignore

Tools used

Conda: Package, dependency and environment management
pre-commit: framework for managing and maintaining multi-language pre-commit hooks.

Template Structure

.
├── config                       # Project configuration files
│   ├──environment.yml           # Environment file for conda
├── data                         # Local project data (not committed to version control)
│   ├── 01_raw                   # Raw immutable data
│   ├── 02_primary               # Domain model data
│   ├── 03_feature               # Model features
│   ├── 04_model_input           # Often called 'master tables'
│   ├── 05_model_output          # Data generated by model runs
│   ├── 06_reporting             # Ad hoc descriptive cuts
├── docs                         # Project documentation
├── models                       # Project configuration files
├── notebooks                    # Project related Jupyter notebooks (used for experimental code before moving code to src)
├── README.md                    # Project README
└── src                          # Project source code
    └── main.py

How to use this template

Install Cookiecutter:

pip install cookiecutter

Create a project based on the template:

cookiecutter https://github.com/radema/datascience-personal-templates

Activate the new environment

conda activate {{cookiecutter.environment_name}}

Execute setup in terminal

cd {{cookiecutter.repository-name}}; make setup