The paper was presented by Dr.Lahouari Ghouti at Prince Sultan Univserity in the Software Analytics course (SE480).
The implementation was benchmarked on the Ant 1.7 public dataset for software metrics. According to the paper this particular dataset achieved the highest performance with the proposed APE.
- Software Defect Prediction using Ensemble Learning on Selected Features
- Table of contents
- Approach
- Ensembles Comparison
- Contribute
The approach for this implementation was to follow mostly all research paper details in order to come up with a comprehensive implementation for the paper covering almost all aspects such as Feature Selection, Stratified K Fold CV, and other presented techniques.
The paper presented an AI system that helps in Software Defect Prediction by applying Heterogeneous Ensemble.
Heterogeneous Ensemble is a ML technique in which several models trained on the same dataset. These models can differ in the learning algorithm, hyperparameters, or could be the same. After training the Voting Classifier will contain all trained models in order to apply voting.
The voting was done based on models weights which is Weighted Average Ensemble which means that models that have more weights on the data will eventually be involved more in the voting.
The paper proposed using 7 models for building the voting system. As the paper states 7 was the optimal number of models for this solution as using more models will negatively impact the performance. Also, to avoid voting conflicts, the number of used models should be odd number to avoid conflicts situations.
The used models were the following:
SVC(probability=True)
MultinomialNB()
BernoulliNB()
RandomForestClassifier()
GradientBoostingClassifier()
SGDClassifier(loss='log')
LogisticRegression()
Several Hyperparameters tuning was done using GridSearchCV
to achieve the best performing model with the corresponding Hyperparameters.
As the paper title states, Feature Selection was done to increase preformance and get rid of unnecessary features. The paper comapred between different selection approaches such as Fisher, Chi, and Greedy. Based on the results Greedy method reported the best results and showed that a very few number of feautres contributes to the output.
For applying Feature Selection, the following implementation used MLXTEND
library which comes with ready made functions for applying several Feature Selection approaches depending on the use case.
The following is an example of the Sequential Forward Selection that was applied taken from MLXTEND
documentation.
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
In order to make things easier and reusable, the implementation built a training pipeline for each classifier putting the processing and tuning steps in that pipeline. The pipeline consisted of:
- Feature Selection
- Feature Scaling
- Training
After that the pipeline was inserted into a GridSearchCV
for applying Hyperparameters Tuning
# Data will be scaled with Standard Scaler first
scaler = StandardScaler()
# Feature selection will be applied with 10 - 15 features
sfs = SFS(estimator=LogisticRegression(max_iter = 3500),
k_features=10,
forward=True,
floating=False,
scoring='accuracy',
cv=2)
# Logisitc Regression classifier is trained
clf = LogisticRegression()
steps = [
('sfs', sfs),
('scaler', scaler),
('clf', clf)
]
pipeline = Pipeline(steps)
# Hyperparameters for GridSearch CV
param_range_fl = [1.0, 0.5]
grid_params_lr = [{
'clf__penalty': ['l1', 'l2'],
'clf__C': param_range_fl,
'clf__solver': ['liblinear'],
'sfs__k_features': [10, 15]
}]
One of the main problems that was presented in this dataset was that there is a huge visible class imbalance in the dataset. The difference between the two classes was more than 80% which will eventually cause a huge overfitting for the models regardless of the used algorithm.
There are mainly two solutions presented for this issue. Both solutions were tested and one of them overcomed the other.
-
Synthetic Minority Oversampling Technique (SMOTE)
This technique works by applying data resampling. Either by generating synthetic data for the underwhelming class
Oversampling
or discarding some records from the overwhelming classUndersampling
. This technique resulted in a significant improvement in modles performance but there was a much better solution. -
Stratified K Fold Cross Validation
Implementing the concept of stratified sampling in cross-validation ensures the training and test sets have the same proportion of the feature of interest as in the original dataset. Doing this with the target variable ensures that the cross-validation result is a close approximation of generalization error. This technique resulted in a better performance and discarded the need of either creating synthetic data or removing data.
Before creating the (Average Probability Estimator) APE
voting system, each model was trained in previous individually and saved.
The following table summerizes each model performance metrics:
Model | Overall Test Accuracy |
---|---|
Logisitc Regression | %82.66 |
Bernouli Naive Bayes | %77.70 |
Gradient Boost | %82.35 |
Multinomial Naive Bayes | %77.98 |
Random Forest | %80.66 |
Stochastic Gradient Descent | %82.41 |
Support Vector Machines - SVC | %81.20 |
The weighted voting method assigns various weights to the classifiers based on specific criteria and takes a vote of the classifiers based on the weight. In this work, the weight of each classifier would be chosen based on the performance accuracy of the classifier based on the testing set.
A weighted ensemble is an extension of a model averaging ensemble where the contribution of each member to the final prediction is weighted by the performance of the model.
First all models were loaded and evaluated on the Test set in order to get their weights and apply the Weighted Ensemble based on them.
# load models with pickle
bnb = pickle.load(open('models/bnb.sav', 'rb'))
gb = pickle.load(open('models/gb.sav', 'rb'))
lr = pickle.load(open('models/lr.sav', 'rb'))
mnb = pickle.load(open('models/mnb.sav', 'rb'))
rf = pickle.load(open('models/rf.sav', 'rb'))
svc = pickle.load(open('models/svc.sav', 'rb'))
sgd = pickle.load(open('models/sgd.sav', 'rb'))
# dump them into a list
def get_models():
models = list()
models.append(('bnb', bnb))
models.append(('gb', gb))
models.append(('lr', lr))
models.append(('mnb', mnb))
models.append(('rf', rf))
models.append(('sgd', sgd))
models.append(('svc', svc))
return models
# evaluate each base model
def evaluate_models(models, X_test, y_test):
# fit and evaluate the models
scores = list()
for name, model in models:
# predict the test set
yhat = model.predict(X_test)
# find the accuracy
acc = accuracy_score(y_test, yhat)
# store the performance
scores.append(acc)
# report model performance
return scores
# get models
models = get_models()
# get models weights
scores = evaluate_models(models, X_test, y_test)
The scores
variable contains all models performance on the test set which will play a key part in the voting later on.
After that VotingClassifier
was implemented which estimators were the trained models previously and weights were the scores evaluated previously as well.
from sklearn.ensemble import VotingClassifier
ensemble = VotingClassifier(estimators = models, voting = 'soft', weights = scores)
The aim of this paper was to develop an AI system that helps in Software Defect Prediction by applying advanced techniques in order to come up with the best performance. The paper also compared between several ensemble models which are W-SVM
and RandomForest
.
Ensemble Model | Test Accuracy |
---|---|
APE (proposed) | %82.81 |
Random Forest | %81.20 |
Weighted Support Vector Machines | %72.89 |
The paper was presented by Dr. Lahouari Ghouti Associate Professor in Computer Science at Prince Sultan University. Dr. Lahouari presented the paper to his students in the course Software Analytics (SE480) and the paper was implemented by one of his students Mohammed Abed later on.