This is a suggested project setup for a data analysis project. It uses open source projects to help you streamline data analysis workflow, maintain a sane and sensible folder structure, and follow best practices. Specifically, it uses:
- Poetry for environment and dependencies management
- Nbdev and LineaPy for seamless transitions from messy, exploratory Jupyter notebooks to reusable code and packages with beautiful documentation
- Kedro for datasource management via data catalogs and reproducible/visualizable pipelines
- Prefect to orchestrate and schedule pipelines, with retries and complex error handling
... and other modern utility tools like linting with ruff, code coverage with slipcover
You need to have Poetry installed globally and Python >=3.8,<3.11
- Click on the
Use this template
button to create your own repository
-
Clone your repo locally using
git clone <repo url>
-
Edit values in
change-my-values.yaml
,project_name
should be your repository's name
- Run the following commands
python create-repository.py
make reset-project-with-install
Example repository structure when finished
.
├── conf
│ ├── base
│ ├── local
│ └── README.md
├── create-repository.py
├── data
│ ├── 01_raw
│ ├── 02_intermediate
│ ├── 03_primary
│ ├── 04_feature
│ ├── 05_model_input
│ ├── 06_models
│ ├── 07_model_output
│ └── 08_reporting
├── docs
│ └── source
├── kedro-answers.yml
├── LICENSE
├── logs
├── Makefile
├── MANIFEST.in
├── notebooks
│ ├── analyses
│ ├── exploratory
│ ├── generate_figures
│ └── package
├── poetry.lock
├── _proc
│ ├── 00_core.ipynb
│ ├── _docs
│ ├── index.ipynb
│ ├── nbdev.yml
│ ├── _quarto.yml
│ └── styles.css
├── _pyproject.toml
├── pyproject.toml
├── README_kedro.md
├── README.md
├── settings.ini
├── setup.py
└── src
├── requirements.txt
├── setup.py
├── test_package_project
└── tests