Data challenge at Insight link
4. Mercedes-Benz Greener Manufacturing - a regression problem and its solution
Situation:
- Assembled automobiles need to be tested to ensure the safety and reliability.
- Testing is a time-consuming process.
- Different cars have different configurations/features.
Task: how to cut the testing time using a algorithmic approach?
Action: using regression models to identify key features that affect the testing time.
Results: regression models were applied to analyze the correlation between features and testing time.
Takeaways:
- Key features were identified. Effort should be prioritized on optimizing those key features.
- Top 3 features together require more than 40% time for testing, which are ID, X314, and X315.
- Feature X314: took 35.8% testing time, about 36 seconds average.
3. IEEE-CIS Fraud Detection - a classification problem and its solution
Situation:
- Credit card fraud is a common financial fraud, especially during pandemic.
- Shopping everything online is the new norm.
Task: how to maximaize the transaction security with minimal hassles to clients?
Action: developing a predicative model based on machine learning algorithms of binary classification.
Results: maximized the detection rate of fradulent activities while minimizing the number of false alarms (false positive events).
Takeaways:
- For fraud detection, both precision and recall need to be considered for evaluating model performance.
- High precision - less financial loss - favorable for small banking of limited number of transactions.
- High recall - less false flags - better user experience - favorable for large banking.
Version 4 (Latest) and corresponding hyperparameter analysis
- Improvement: data normalization, model optimization
- Note: due to the large size of data set, computation-demanding actions are not performed including cross-validation, learning curve, and the fine tuning of model hyperparameters.
Previous versions:
Version 3
- Improvement: feature selection
- To do: model optimization, data normalization, learning curve, cross-validation
- Improvement: data cleaning
- To do: feature selection
- To do: data cleaning, feature selection
2. Ames House Price Prediction (model fitting practice) - analysis - regression
- Random forest regression, RMSE score 0.18125.
- RMSE (Root Mean Squared Error): lower score is better, testing score provided by Kaggle.
- To do: exploratory data analysis and feature selection
1. RMS Titanic Survival Prediction (testing water) - analysis - classification
- Random forest classification, accuracy score 0.77033.
- Accuracy score: higher score is better, testing score provided by Kaggle.
- To do: Monte Carlo to simulate the missing data, especially passenger age.
Data sets from kaggle.com
- (4) Mercedes-Benz Greener Manufacturing https://www.kaggle.com/c/mercedes-benz-greener-manufacturing
- (3) IEEE-CIS Fraud Detection https://www.kaggle.com/c/ieee-fraud-detection/
- (2) Ames House Price Prediction https://www.kaggle.com/c/house-prices-advanced-regression-techniques
- (1) RMS Titanic Survival Prediction https://www.kaggle.com/c/titanic
Stock Photos from unsplash.com
- Car assembly line by Lenny Kuhne: https://unsplash.com/photos/jHZ70nRk7Ns
- Business Man with a credit card by rupixen https://unsplash.com/photos/Q59HmzK38eQ