As
DS application demo
part of the "Daas (Data as a service) repo", this repo using jupyter notebook (mainly) as media showing step-by-step analysis and ML/DL approaches on various data science subjects. The idea is : demo how does a data scientist deal with a new dataset, pre-process the data, do exploration analysis (EDA), then running suitable model and offering suggestions with business feasibility and acceptable statistical errors. (i.e. DS workflow : business understanding -> data preprocess -> EDA -> data understanding -> analysis/modeling ). Main focus of this project: 1) Statistics/ML analysis 2) ML theory/algorithms explanation 3) Spark op/ML demo
- Daas (Data as a service) repo : Data infra build -> ETL build -> DS application demo
- Airflow Heroku demo : airflow-heroku-dev
- Mlflow Heroku demo : mlflow-heroku-dev
├── DE_course : Code for Udacity data engineer course
├── DL_ : Deep learning relative projects
├── DS_algorithms : Build Data science model from scratch
├── GPU : GPU relative code
├── ML_ : Machine learning relative projects
├── README.md
├── R_ : R programming language relative projects
├── SPARK_ : Pyspark basics/op/ML/ETL notebook demo projects
├── Statistics_ : Statistics relative projects
├── archived : Archived code/projects
├── doc : Doc for quick start, theory paper, pic.. and so on
├── ml_demo.py
├── notebook : Jupyter notebook relative projects (nb server/magic..)
├── project : Archived projects
├── pytorch_ : Pytorch relative projects
├── tensorflow_ : Tensorflow relative projects
└── utility : Utility scripts for ML/DL model tuning, DS plots...
- Gradian Decent - Main model optimization algorithms demo
- Linear Regression - Simplest regression model
- Logistics Regression - Simplest classification model
- Support vector machine -
- Decision Tree - Simple non-linear regression/classification model
- L1 L2 Regularization - Basic model tuning method
- TF Linear Regression - TF Linear Regression demo
- TF Random Forest - TF Random Forest Classification demo
- Confidence Intervals - Go through the confidence interval calculation from distributions
- AB TEST Part 1 - Hypothesis Test | P-value | T-test
- AB TEST Part 2 - Bootstrapping
- TIME SERIES Part 1 - Stationary
spark op intro
- Pyspark Basic 1 - Basic spark ops (transform & action): RDD,Map,FlatMap, Reduce,filter, Distinct, Intersection
- Pyspark Basic 2 -Basic spark ops : load csv,dataframe,SparkSQL, transformation in [RDD, dataframe, SparkSQL]
- Pyspark Basic 3 -Basic spark ops : Spark DataFrame OP
spark ML intro
- Pyspark LinearRegression nb step by step demo - Spark Linear Regression model tutorial
- Pyspark LogisticRegression nb step by step demo - Spark Logistic Regression model tutorial
- Pyspark Pipeline/Index/Encode nb step by step demo - Train a spark ML model via pipeline and preprocess (index/encoder)
- Pyspark Tree models step by step demo - Train spark ML tree models (Decision tree/Random Forest/Gradient boosting)
- Pyspark Kmeans models step by step demo - Train spark ML cluster kmeans models
- Pyspark LinearRegression demo - Train a linear model with Spark ML framework
- Pyspark LinearRegression Grid Search demo - Train a Grid Search tuned linear model with Spark ML framework
- Pyspark Collaborative Filtering demo - Spark recommdation algorithm (ALS model) demo with movielens dataset
spark APP
- Pyspark Data Preprocess 1 : PTT - Digest PTT (批踢踢實業坊) data via Pyspark batch operations. PTT
- Pyspark Data Preprocess 2 : UBER Digest UBER data via Pyspark batch operations.
- Pyspark Streaming demo 1 - a simple word-count app data via Pyspark stream
dev