Repository for my IBM AI/ML project based certification course.
The colab notes reflect strutured approach to ML/AI problem solving to serve as my general workflow template:
- After problem is defined and understood - Plan data collection and retrieve the data. Colab files with phrases "Data Retrieval and Data Pull" in that sequence are the structured approach (template) to follow for data pull best practises
- The colab notes/file with phrase "Data_Cleaning" is good template/workflow for cleaning data for ML models
- The colab notes/file with phrase "EDA" in file name is a good template to guide EDA (exploratory data analysis). Specific exploratory techniques worthy of prioritizing will depend on the specific work and use case
- The colab notes/file with phrase "feature engineering" in file name is a guide to feature engineering
*Python File Structure Per Workflow
A Files: Data Collection, Cleaning*
A1 -IBM_ML_AI_DataRetrieval_SQL_WK2.1
A2 -IBM_ML_AI_WK2_2_Data_Cleaning_Lab
A3 -ibm_ml_ai_datapullsqlwk2_2
A4 -Wk2Data_Cleaning_Lab2
B Files: Exploratory Data Analysis (EDA)*
B1 -ML_EDA_FLOW_Wk3_1c_EDA B2 -Week3_1c_EDA
C Files: Feature Engineering and other Data Preprocessing
C1 -Wk31d_Feature_Engineering C2 -Wk3_Feature_Engineering2_PCA
*D Files: Feature Engineering / Hypothesis Testing (Preproc part 2)
D1 -Wk41e_Hypothesis_Testing D2 -Wk41f_HypothesisTesting_2
E Files: Regression - Train/Test Split (simple linear regression) & Polynomial Regression
Data files for section E-> car price.CSV file for simple linear regression; Ames_housing data for the polynomial regression; encoded_car_data_PolyFeat.csv
E1 - L2_wk1_linear_regression
E2 - 02bL2_Wk2_LAB_Regression_Train_Test_Split
E3 - 02cL2_Wk2_Polynomial_Regression
Standardization, chaining steps using pipelines, transformations etc.
F Files: Cross Validation; Grid_CV and Regularization (using Ridge, Lasso, E-Net) Data files for section E - encoded car data, see not books. F3 and F4 ->This workflow spends time on using the pipeline approach to chain ML steps and also using Gridsearch for hyperparameter selection by performing hyperparameter selection on a model using validation data and finally showing the impact of PCA (principal component analysis). F4 introduces PCA (principal component analysis in reducing dimensionality)
F1-02cL2_Wk3_DEMO_Cross_Validation
F2- 02dL2_Wk4_DEMO_Regularization
F3-02eL2_WK5_LAB_Regularization_jupyterlite.ipynb
F4-02eeL2_WK5_Regularization_Techniques.ipynb
G - Section is ML - Classification: + Advanced Model Types/Reg Techniques: Bagging,Boosting and othe Ensemble methods
Logistic Reg error metrics- precision, recall, confusion matrix etc. KNN approach to class, support vector machines
Bagging, Gradient Boosting, XG Boost,Stacking.
G1 - 03aL3_Wk1_Logistic_Regression_Error_Metrics.ipynb
G2 - 03bL3.Wk2_KNN1.ipynb
G3 - 03cL3_Wk3_SVM.ipynb
G4 - 03dL3_Wk3_SVM_RBF.ipynb
G5 - 03dL3_WK3_Decision_Trees.ipynb
G6 - 03e_L3_Wk5_Bagging.ipynb
G7 - 03gL3_Wk5_GradientBoosting_and_Stacking.ipynb
G8 - 3gL3_Wk3_Ada_Boost.ipynb
G9 - 3fL3_WK5_Stacking_Classification.ipynb
G10 -3g_L3_Wk5_XGBoost.ipynb
H - Dealing with imbalanced datasets typical patterns of imbalanced data challenges
- Class Re-weighting method to adjust the impacts of different classes in model training processes
- Oversampling and Undersampling to generate synthetic datasets and rebalance classes
- Evaluate consolidated classifiers using robust metrics such as F-score and AUC
H1 - 3hL3Wk6__imbalanced_data.ipynb