Python data analysis project structure template

Introduction

This is a suggested project setup for a data analysis project. It uses open source projects to help you streamline data analysis workflow, maintain a sane and sensible folder structure, and follow best practices. Specifically, it uses:

  • Poetry for environment and dependencies management
  • Nbdev and LineaPy for seamless transitions from messy, exploratory Jupyter notebooks to reusable code and packages with beautiful documentation
  • Kedro for datasource management via data catalogs and reproducible/visualizable pipelines
  • Prefect to orchestrate and schedule pipelines, with retries and complex error handling

... and other modern utility tools like linting with ruff, code coverage with slipcover

Getting started

Prerequisite

You need to have Poetry installed globally and Python >=3.8,<3.11

Setup steps

  1. Click on the Use this template button to create your own repository

image

image

  1. Clone your repo locally using git clone <repo url>

  2. Edit values in change-my-values.yaml, project_name should be your repository's name

image

  1. Run the following commands
python create-repository.py
make reset-project-with-install
Example repository structure when finished
.
├── conf
│   ├── base
│   ├── local
│   └── README.md
├── create-repository.py
├── data
│   ├── 01_raw
│   ├── 02_intermediate
│   ├── 03_primary
│   ├── 04_feature
│   ├── 05_model_input
│   ├── 06_models
│   ├── 07_model_output
│   └── 08_reporting
├── docs
│   └── source
├── kedro-answers.yml
├── LICENSE
├── logs
├── Makefile
├── MANIFEST.in
├── notebooks
│   ├── analyses
│   ├── exploratory
│   ├── generate_figures
│   └── package
├── poetry.lock
├── _proc
│   ├── 00_core.ipynb
│   ├── _docs
│   ├── index.ipynb
│   ├── nbdev.yml
│   ├── _quarto.yml
│   └── styles.css
├── _pyproject.toml
├── pyproject.toml
├── README_kedro.md
├── README.md
├── settings.ini
├── setup.py
└── src
    ├── requirements.txt
    ├── setup.py
    ├── test_package_project
    └── tests