Deep Learning Project Template

This template offers a lightweight yet functional project template for various deep learning projects. The template assumes PyTorch as the deep learning framework. However, one can easily transfer and utilize the template to any project implemented with other frameworks.

Dependencies
Getting Started
Template Layout
Extra Packages
Resources
Future Tasks
Authors
License

Dependencies

required:
    python>=3.7
    numpy>=1.18
    pandas>=1.0
    torch>=1.4
    scikit-learn>=0.22

optional:
    poetry>=0.12
    flake8>=3.7
    pylint>=2.4
    mypy>=0.76
    pytest>=5.3
    GPUtil>=1.4

Getting Started

You can fork this repo and use it as a template when creating a new repo on Github like this:

Or directly use the template from the forked template repo like this:

Alternatively, you can simply download this repo in zipped format and get started:

Next, you can install all the dependencies by typing the following command in project root:

make install # or "poetry install"

Finally, you can wrap up the setup by manually install and update any packages you'd like. Please refer to the Extra Packages section for some awesome packages.

Template Layout

The project layout with the usage for each folder is shown below:

dl-project-template
.
|
├── LICENSE.md
├── README.md
├── makefile            # makefile for various commands (install, train, pytest, mypy, lint, etc.) 
├── mypy.ini            # MyPy type checking configurations
├── pylint.rc           # Pylint code quality checking configurations
├── pyproject.toml      # Poetry project and environment configurations
|
├── data
|   ├── ...             # data reference files (index, readme, etc.)
│   ├── raw             # untreated data directly downloaded from source
│   ├── interim         # intermediate data processing results
│   └── processed       # processed data (features and targets) ready for learning
|
├── notebooks           # Jupyter Notebooks (mostly for data processing and visualization)
│── src    
│   │── ...             # top-level scripts for training, testing and downloading
│   ├── configs         # configuration files for deep learning experiments
│   ├── data_processes  # data processing functions and classes (cleaning, validation, imputation etc.)
│   ├── modules         # activations, layers, modules, and networks (subclass of torch.nn.Module)
│   ├── optimization    # deep learning optimizers and schedulers
│   └── utilities       # other useful functions and classes
├── tests               # unit tests module for ./src
│
├── docs                # documentation files (*.txt, *.doc, *.jpeg, etc.)
├── logs                # logs for deep learning experiments
└── models              # saved models with optimizer states

Extra Packages

Data Validation and Cleaning

Great Expectation: data validation, documenting, and profiling
Cerberus: lightweight data validation functionality
PyJanitor: Pandas extension for data cleaning
PyDQC: automatic data quality checking
Feature-engine: transformer library for feature preparation and engineering

Performance and Caching

Numba: JIT compiler that translates Python and NumPy to fast machine code
Dask: parallel computing library
Ray: framework for distributed applications
Modin: parallelized Pandas with Dask or Ray
Vaex: lazy memory-mapping dataframe for big data
Joblib: disk-caching and parallelization
RAPIDS: GPU acceleration for data science

Data Version Control and Workflow

DVC: data version control system
Pachyderm: data pipelining (versioning, lineage/tracking, and parallelization)
d6tflow: effective data workflow
Metaflow: end-to-end independent workflow
Dolt: relational database with version control
Airflow: platform to programmatically author, schedule and monitor workflows
Luigi: dependency resolution, workflow management, visualization, etc.

Visualization and Presentation

Seaborn: data visualization based on Matplotlib
HiPlot: interactive high-dimensional visualization for correlation and pattern discovery
Plotly.py: interactive browser-based graphing library
Altair: declarative visualization based on Vega and Vega-Lite
TabPy: Tableau visualizations with Python
Chartify: easy and flexible charts
Pandas-Profiling: HTML profiling reports for Pandas DataFrames
missingno: toolset of flexible and easy-to-use missing data visualizations and utilities
Yellowbrick: Scikit-Learn visualization for model selection and hyperparameter tuning
FlashTorch: visualization toolkit for neural networks in PyTorch

Project Lifecycles and Hyperparameter Optimization

NNI: automate ML/DL lifecycle (feature engineering, neural architecture search, model compression and hyperparameter tuning)
Comet.ml: self-hosted and cloud-based meta machine learning platform for tracking, comparing, explaining and optimizing experiments and models
MLflow: platform for ML lifecycle , including experimentation, reproducibility and deployment
Optuna: automatic hyperparameter optimization framework
Hyperopt: serial and parallel optimization
Tune: scalable experiment execution and hyperparameter tuning

PyTorch Extensions

Ignite: high-level library based on PyTorch
PyTorch Lightning: lightweight wrapper for less boilerplate
RaySGD: lightweight wrappers for distributed deep learning
fastai: out-of-the-box tools and models for vision, text, and other data
Skorch: Scikit-Learn interface for PyTorch models
PyRo: deep universal probabilistic programming with PyTorch
Kornia: differentiable computer vision library
DGL: package for deep learning on graphs
PyGeometric: geometric deep learning extension library for PyTorch
Torchmeta: datasets and models for few-shot-learning/meta-learning
PyTorch3D: library for deep learning with 3D data
learn2learn: meta-learning model implementations
higher: higher-order (unrolled first-order) optimization
Captum: model interpretability and understanding
PyTorch summary: Keras style summary for PyTorch models

Miscellaneous

Awesome-Pytorch-list: a comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
DoWhy: causal inference combining causal graphical models and potential outcomes
NetworkX: creation, manipulation, and study of complex networks/graphs
Gym: toolkit for developing and comparing reinforcement learning algorithms
Polygames: a platform of zero learning with a library of games
Mlxtend: extensions and helper modules for data analysis and machine learning
NLTK: a leading platform for building Python programs to work with human language data
PyCaret: low-code machine learning library
dabl: baseline library for data analysis
OGB: benchmark datasets, data loaders and evaluators for graph machine learning

Resources

Datasets:

Google Datasets: high-demand public datasets
Google Dataset Search: a search engine for freely-available online data
OpenML: online platform for sharing data, ML algorithms and experiments
DoltHub: data collaboration with Dolt
OpenBlender: live-streamed open data sources

Readings:

Machine Learning Systems Design by Chip Huyen
Rules of Machine Learning: Best Practices for ML Engineering by Martin Zinkevich
Awesome Data Science: an awesome data science repository to learn and apply for real world problems

Other ML/DL Templates:

Cookiecutter Data Science: a logical, reasonably standardized, but flexible project structure
PyTorch Template Project: PyTorch deep learning project template

Future Tasks

ML/DL projects process flowchart
- definition of several major steps
- clarify motivation and deliverables
small example for demonstration (omniglot?)

Authors

Xiaotian Duan (Email: xduan7 at uchicago.edu)

License

This project is licensed under the MIT License - see the LICENSE.md file for more details.

xduan7/causal-discovery