IBM ML/AI Certification Projects Repo - EDA, Supervised ML

Primary LanguageJupyter Notebook

IBM-ML-AI-CERT-PROJECT - EDA, and Supervised Learning:

Repository for my IBM AI/ML project based certification course.

The colab notes reflect strutured approach to ML/AI problem solving to serve as my general workflow template:

  1. After problem is defined and understood - Plan data collection and retrieve the data. Colab files with phrases "Data Retrieval and Data Pull" in that sequence are the structured approach (template) to follow for data pull best practises
  2. The colab notes/file with phrase "Data_Cleaning" is good template/workflow for cleaning data for ML models
  3. The colab notes/file with phrase "EDA" in file name is a good template to guide EDA (exploratory data analysis). Specific exploratory techniques worthy of prioritizing will depend on the specific work and use case
  4. The colab notes/file with phrase "feature engineering" in file name is a guide to feature engineering

*Python File Structure Per Workflow

A Files: Data Collection, Cleaning*

A1 -IBM_ML_AI_DataRetrieval_SQL_WK2.1

A2 -IBM_ML_AI_WK2_2_Data_Cleaning_Lab

A3 -ibm_ml_ai_datapullsqlwk2_2

A4 -Wk2Data_Cleaning_Lab2

B Files: Exploratory Data Analysis (EDA)*

B1 -ML_EDA_FLOW_Wk3_1c_EDA B2 -Week3_1c_EDA

C Files: Feature Engineering and other Data Preprocessing

C1 -Wk31d_Feature_Engineering C2 -Wk3_Feature_Engineering2_PCA

*D Files: Feature Engineering / Hypothesis Testing (Preproc part 2)

D1 -Wk41e_Hypothesis_Testing D2 -Wk41f_HypothesisTesting_2

E Files: Regression - Train/Test Split (simple linear regression) & Polynomial Regression

Data files for section E-> car price.CSV file for simple linear regression; Ames_housing data for the polynomial regression; encoded_car_data_PolyFeat.csv

E1 - L2_wk1_linear_regression

E2 - 02bL2_Wk2_LAB_Regression_Train_Test_Split

E3 - 02cL2_Wk2_Polynomial_Regression

Standardization, chaining steps using pipelines, transformations etc.

F Files: Cross Validation; Grid_CV and Regularization (using Ridge, Lasso, E-Net) Data files for section E - encoded car data, see not books. F3 and F4 ->This workflow spends time on using the pipeline approach to chain ML steps and also using Gridsearch for hyperparameter selection by performing hyperparameter selection on a model using validation data and finally showing the impact of PCA (principal component analysis). F4 introduces PCA (principal component analysis in reducing dimensionality)

F2- 02dL2_Wk4_DEMO_Regularization

G - Section is ML - Classification: + Advanced Model Types/Reg Techniques: Bagging,Boosting and othe Ensemble methods

Logistic Reg error metrics- precision, recall, confusion matrix etc. KNN approach to class, support vector machines
Bagging, Gradient Boosting, XG Boost,Stacking.

G1 - 03aL3_Wk1_Logistic_Regression_Error_Metrics.ipynb
G2 - 03bL3.Wk2_KNN1.ipynb
G3 - 03cL3_Wk3_SVM.ipynb
G4 - 03dL3_Wk3_SVM_RBF.ipynb
G5 - 03dL3_WK3_Decision_Trees.ipynb
G6 - 03e_L3_Wk5_Bagging.ipynb

G7 - 03gL3_Wk5_GradientBoosting_and_Stacking.ipynb

G8 - 3gL3_Wk3_Ada_Boost.ipynb

G9 - 3fL3_WK5_Stacking_Classification.ipynb

G10 -3g_L3_Wk5_XGBoost.ipynb

H - Dealing with imbalanced datasets typical patterns of imbalanced data challenges

  • Class Re-weighting method to adjust the impacts of different classes in model training processes
  • Oversampling and Undersampling to generate synthetic datasets and rebalance classes
  • Evaluate consolidated classifiers using robust metrics such as F-score and AUC

H1 - 3hL3Wk6__imbalanced_data.ipynb