/Data-Science-Toolbox

Examples and illustration of basic statistic concepts, probability distribution, Monte Carlo simulation, preprocessing and visualization techniques, and statistical testing.

Primary LanguageJupyter NotebookMIT LicenseMIT

Data Science Toolbox

Examples and illustration of basic statistic concepts, probability distribution, Monte Carlo simulation, preprocessing and visualization techniques, and statistical testing.

This repo is divided into multiple sections. Each section focuses on a group of concept, operation, or data science toolkit. A brief summary is presented below (in alphabetical order):

  • algorithm: rudimentary algorithm and data structure, using LeetCode problems for practice. Solutions are provided and commented. Problems are grouped by subjects (BFS, DFS, tree etc.).
  • case_study: small datasets that exemplifies regression/classification workflow, from data cleaning to feature engineering, modeling, training, evaluation etc.
  • cheatsheet: useful derivations of commonly used formula, for future review.
  • data_structure: standalone data structure overview in Python (union-find, trie etc.).
  • distribution: common probability distribution, PDF, CDF, simulation methods.
  • handy_syntax: Python tricks that are often useful but hard to memorize.
  • keras: high level tensorflow API. Basic use cases.
  • models: more in-depth study of specific ML models.
  • preprocessing: data cleaning, smoothing, pipelining etc.
  • simulation: interesting simulation experiments for brain exercise.
  • statistic_test: a catalogue of commonly used statistical tests and implementation in Python.
  • tensorflow: plain tensorflow for deep learning tasks.
  • training: tricks to speed up training tensorflow models.
  • unittest: standard protocols for Python unit tests.
  • unix: keeping track of Python environments and for easy restoration.
  • visualization: example use case of matplotlib, plotly, ggplot, and animation.