/causal-discovery

Machine learning experiments on causal discovery

Primary LanguagePythonMIT LicenseMIT

Deep Learning Project Template

This template offers a lightweight yet functional project template for various deep learning projects. The template assumes PyTorch as the deep learning framework. However, one can easily transfer and utilize the template to any project implemented with other frameworks.

Table of Contents

Dependencies

required:
    python>=3.7
    numpy>=1.18
    pandas>=1.0
    torch>=1.4
    scikit-learn>=0.22

optional:
    poetry>=0.12
    flake8>=3.7
    pylint>=2.4
    mypy>=0.76
    pytest>=5.3
    GPUtil>=1.4

Getting Started

You can fork this repo and use it as a template when creating a new repo on Github like this:

Or directly use the template from the forked template repo like this:

Alternatively, you can simply download this repo in zipped format and get started:

Next, you can install all the dependencies by typing the following command in project root:

make install # or "poetry install"

Finally, you can wrap up the setup by manually install and update any packages you'd like. Please refer to the Extra Packages section for some awesome packages.

Template Layout

The project layout with the usage for each folder is shown below:

dl-project-template
.
|
├── LICENSE.md
├── README.md
├── makefile            # makefile for various commands (install, train, pytest, mypy, lint, etc.) 
├── mypy.ini            # MyPy type checking configurations
├── pylint.rc           # Pylint code quality checking configurations
├── pyproject.toml      # Poetry project and environment configurations
|
├── data
|   ├── ...             # data reference files (index, readme, etc.)
│   ├── raw             # untreated data directly downloaded from source
│   ├── interim         # intermediate data processing results
│   └── processed       # processed data (features and targets) ready for learning
|
├── notebooks           # Jupyter Notebooks (mostly for data processing and visualization)
│── src    
│   │── ...             # top-level scripts for training, testing and downloading
│   ├── configs         # configuration files for deep learning experiments
│   ├── data_processes  # data processing functions and classes (cleaning, validation, imputation etc.)
│   ├── modules         # activations, layers, modules, and networks (subclass of torch.nn.Module)
│   ├── optimization    # deep learning optimizers and schedulers
│   └── utilities       # other useful functions and classes
├── tests               # unit tests module for ./src
│
├── docs                # documentation files (*.txt, *.doc, *.jpeg, etc.)
├── logs                # logs for deep learning experiments
└── models              # saved models with optimizer states

Extra Packages

Data Validation and Cleaning

  • Great Expectation: data validation, documenting, and profiling
  • Cerberus: lightweight data validation functionality
  • PyJanitor: Pandas extension for data cleaning
  • PyDQC: automatic data quality checking
  • Feature-engine: transformer library for feature preparation and engineering

Performance and Caching

  • Numba: JIT compiler that translates Python and NumPy to fast machine code
  • Dask: parallel computing library
  • Ray: framework for distributed applications
  • Modin: parallelized Pandas with Dask or Ray
  • Vaex: lazy memory-mapping dataframe for big data
  • Joblib: disk-caching and parallelization
  • RAPIDS: GPU acceleration for data science

Data Version Control and Workflow

  • DVC: data version control system
  • Pachyderm: data pipelining (versioning, lineage/tracking, and parallelization)
  • d6tflow: effective data workflow
  • Metaflow: end-to-end independent workflow
  • Dolt: relational database with version control
  • Airflow: platform to programmatically author, schedule and monitor workflows
  • Luigi: dependency resolution, workflow management, visualization, etc.

Visualization and Presentation

  • Seaborn: data visualization based on Matplotlib
  • HiPlot: interactive high-dimensional visualization for correlation and pattern discovery
  • Plotly.py: interactive browser-based graphing library
  • Altair: declarative visualization based on Vega and Vega-Lite
  • TabPy: Tableau visualizations with Python
  • Chartify: easy and flexible charts
  • Pandas-Profiling: HTML profiling reports for Pandas DataFrames
  • missingno: toolset of flexible and easy-to-use missing data visualizations and utilities
  • Yellowbrick: Scikit-Learn visualization for model selection and hyperparameter tuning
  • FlashTorch: visualization toolkit for neural networks in PyTorch

Project Lifecycles and Hyperparameter Optimization

  • NNI: automate ML/DL lifecycle (feature engineering, neural architecture search, model compression and hyperparameter tuning)
  • Comet.ml: self-hosted and cloud-based meta machine learning platform for tracking, comparing, explaining and optimizing experiments and models
  • MLflow: platform for ML lifecycle , including experimentation, reproducibility and deployment
  • Optuna: automatic hyperparameter optimization framework
  • Hyperopt: serial and parallel optimization
  • Tune: scalable experiment execution and hyperparameter tuning

PyTorch Extensions

  • Ignite: high-level library based on PyTorch
  • PyTorch Lightning: lightweight wrapper for less boilerplate
  • RaySGD: lightweight wrappers for distributed deep learning
  • fastai: out-of-the-box tools and models for vision, text, and other data
  • Skorch: Scikit-Learn interface for PyTorch models
  • PyRo: deep universal probabilistic programming with PyTorch
  • Kornia: differentiable computer vision library
  • DGL: package for deep learning on graphs
  • PyGeometric: geometric deep learning extension library for PyTorch
  • Torchmeta: datasets and models for few-shot-learning/meta-learning
  • PyTorch3D: library for deep learning with 3D data
  • learn2learn: meta-learning model implementations
  • higher: higher-order (unrolled first-order) optimization
  • Captum: model interpretability and understanding
  • PyTorch summary: Keras style summary for PyTorch models

Miscellaneous

  • Awesome-Pytorch-list: a comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
  • DoWhy: causal inference combining causal graphical models and potential outcomes
  • NetworkX: creation, manipulation, and study of complex networks/graphs
  • Gym: toolkit for developing and comparing reinforcement learning algorithms
  • Polygames: a platform of zero learning with a library of games
  • Mlxtend: extensions and helper modules for data analysis and machine learning
  • NLTK: a leading platform for building Python programs to work with human language data
  • PyCaret: low-code machine learning library
  • dabl: baseline library for data analysis
  • OGB: benchmark datasets, data loaders and evaluators for graph machine learning

Resources

Datasets:

Readings:

Other ML/DL Templates:

Future Tasks

  • ML/DL projects process flowchart
    • definition of several major steps
    • clarify motivation and deliverables
  • small example for demonstration (omniglot?)

Authors

  • Xiaotian Duan (Email: xduan7 at uchicago.edu)

License

This project is licensed under the MIT License - see the LICENSE.md file for more details.