- Please open all .ipynb notebooks in google.colab , because github has problems with showing visualization packages.
- In the given project binary classification was conducted, main idea is to make an algorithm that can divide between Financial and Tech billionaires from the API Forbes data.
- This project can help journalist to fill information for upcoming billionaires, also conducted EDA helps to spot some correlations between different open-source information about billionaires/millionaires that could be used in social studies.
- Packages: pandas, catboost, shap, sklearn, google, numpy, matplotlib, seaborn, plotly, missingno, os, datetime, imblearn
- Data: Public Forbes API https://rapidapi.com/snldnc-kpCtDKbxo_F/api/forbes-worlds-billionaires-list/pricing
- This project was built using Google Colab. Therefore to view plotly plots you should view notebook by opening colab notebook.
- In this github repository there are two notebooks. First of them -
ForbesData.ipynb
which has data extraction from Forbes API. Second one isForbesBinaryClassification.ipynb
- which has EDA & Building model process.
- Installation & Import of required libraries
- Structure Investigation
- Exploratory Data Analysis (EDA)
- Correlationship Analysis
- Preprocessing dataframe for building models
- Baseline models
- Fine tuning
- Interpretability
- Ridge Classifier
- LogisticRegression
- DecisionTree Classifier
- RandomForest Classifier
- GradientBoosting Classifier
- AdaBoost Classifier
- CatBoost Classifier
- Voting Classifier (RandomForest Classifier with GradientBoosting Classifier (hard voting))
- Voting Classifier (RandomForest Classifier with GradientBoosting Classifier (soft voting))
- Bar plot containing Test accuracy results of Models
- For fine-tuning RandomForestClassifier was choosen.
GridSearchCV
were used for finding optimal parameters. Fine-tuned parameters
n-estimators
- The number of trees in the forest.max_depth
- The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.min_samples_split
- The minimum number of samples required to split an internal node
SHAP — which stands for SHapley Additive exPlanations — is probably the state of the art in Machine Learning explainability. This algorithm was first published in 2017 by Lundberg and Lee and it is a brilliant way to reverse-engineer the output of any predictive algorithm. In a nutshell, SHAP values are used whenever you have a complex model (could be a gradient boosting, a neural network, or anything that takes some features as input and produces some predictions as output) and you want to understand what decisions the model is making.
Original Paper about SHAP values
- Feature importance calculated by SHAP value
- How features influence decision-making in different cases (row[1])
- How features influence decision-making in different cases (row[20])
[1] https://towardsdatascience.com/shap-explained-the-way-i-wish-someone-explained-it-to-me-ab81cc69ef30
[3] https://rapidapi.com/snldnc-kpCtDKbxo_F/api/forbes-worlds-billionaires-list/pricing
[4] https://images.forbes.com/the-forbes-400/the-forbes-400-thumbnail.jpg
[5] https://www.kaggle.com/code/dansbecker/advanced-uses-of-shap-values