Software Defect Prediction using Ensemble Learning on Selected Features

Research Paper Implementation

A comprehensive implementation of the paper Software defect prediction using ensemble learning on selected features.

The paper was presented by Dr.Lahouari Ghouti at Prince Sultan Univserity in the Software Analytics course (SE480).

The implementation was benchmarked on the Ant 1.7 public dataset for software metrics. According to the paper this particular dataset achieved the highest performance with the proposed APE.

Software Defect Prediction using Ensemble Learning on Selected Features
Table of contents
Approach
Ensembles Comparison
Contribute

Approach

The approach for this implementation was to follow mostly all research paper details in order to come up with a comprehensive implementation for the paper covering almost all aspects such as Feature Selection, Stratified K Fold CV, and other presented techniques.

The paper presented an AI system that helps in Software Defect Prediction by applying Heterogeneous Ensemble.

Heterogeneous Ensemble is a ML technique in which several models trained on the same dataset. These models can differ in the learning algorithm, hyperparameters, or could be the same. After training the Voting Classifier will contain all trained models in order to apply voting.

The voting was done based on models weights which is Weighted Average Ensemble which means that models that have more weights on the data will eventually be involved more in the voting.

Used Models

The paper proposed using 7 models for building the voting system. As the paper states 7 was the optimal number of models for this solution as using more models will negatively impact the performance. Also, to avoid voting conflicts, the number of used models should be odd number to avoid conflicts situations.

The used models were the following:

SVC(probability=True)
MultinomialNB()
BernoulliNB()
RandomForestClassifier()
GradientBoostingClassifier()
SGDClassifier(loss='log')
LogisticRegression()

Several Hyperparameters tuning was done using GridSearchCV to achieve the best performing model with the corresponding Hyperparameters.

Feature Selection

As the paper title states, Feature Selection was done to increase preformance and get rid of unnecessary features. The paper comapred between different selection approaches such as Fisher, Chi, and Greedy. Based on the results Greedy method reported the best results and showed that a very few number of feautres contributes to the output.

For applying Feature Selection, the following implementation used MLXTEND library which comes with ready made functions for applying several Feature Selection approaches depending on the use case.

Sequential Forward Selection

The following is an example of the Sequential Forward Selection that was applied taken from MLXTEND documentation.

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs1 = SFS(knn, 
           k_features=3, 
           forward=True, 
           floating=False, 
           verbose=2,
           scoring='accuracy',
           cv=0)

sfs1 = sfs1.fit(X, y)

Training Pipeline

In order to make things easier and reusable, the implementation built a training pipeline for each classifier putting the processing and tuning steps in that pipeline. The pipeline consisted of:

Feature Selection
Feature Scaling
Training

After that the pipeline was inserted into a GridSearchCV for applying Hyperparameters Tuning

  # Data will be scaled with Standard Scaler first
  scaler = StandardScaler()

  # Feature selection will be applied with 10 - 15 features
  sfs = SFS(estimator=LogisticRegression(max_iter = 3500), 
            k_features=10,
            forward=True, 
            floating=False, 
            scoring='accuracy',
            cv=2)

  # Logisitc Regression classifier is trained
  clf = LogisticRegression()

  steps = [
      ('sfs', sfs), 
      ('scaler', scaler), 
      ('clf', clf)
  ]

  pipeline = Pipeline(steps)

  # Hyperparameters for GridSearch CV

  param_range_fl = [1.0, 0.5]

  grid_params_lr = [{
          'clf__penalty': ['l1', 'l2'],
          'clf__C': param_range_fl,
          'clf__solver': ['liblinear'],
          'sfs__k_features': [10, 15]
  }]

Solve Class Imbalance

One of the main problems that was presented in this dataset was that there is a huge visible class imbalance in the dataset. The difference between the two classes was more than 80% which will eventually cause a huge overfitting for the models regardless of the used algorithm.

There are mainly two solutions presented for this issue. Both solutions were tested and one of them overcomed the other.

Synthetic Minority Oversampling Technique (SMOTE)

This technique works by applying data resampling. Either by generating synthetic data for the underwhelming class Oversampling or discarding some records from the overwhelming class Undersampling. This technique resulted in a significant improvement in modles performance but there was a much better solution.
Stratified K Fold Cross Validation

Implementing the concept of stratified sampling in cross-validation ensures the training and test sets have the same proportion of the feature of interest as in the original dataset. Doing this with the target variable ensures that the cross-validation result is a close approximation of generalization error. This technique resulted in a better performance and discarded the need of either creating synthetic data or removing data.

Individual Results

Before creating the (Average Probability Estimator) APE voting system, each model was trained in previous individually and saved. The following table summerizes each model performance metrics:

Model	Overall Test Accuracy
Logisitc Regression	%82.66
Bernouli Naive Bayes	%77.70
Gradient Boost	%82.35
Multinomial Naive Bayes	%77.98
Random Forest	%80.66
Stochastic Gradient Descent	%82.41
Support Vector Machines - SVC	%81.20

Average Probability Estimator

The weighted voting method assigns various weights to the classifiers based on specific criteria and takes a vote of the classifiers based on the weight. In this work, the weight of each classifier would be chosen based on the performance accuracy of the classifier based on the testing set.

A weighted ensemble is an extension of a model averaging ensemble where the contribution of each member to the final prediction is weighted by the performance of the model.

First all models were loaded and evaluated on the Test set in order to get their weights and apply the Weighted Ensemble based on them.

  # load models with pickle
  bnb = pickle.load(open('models/bnb.sav', 'rb'))
  gb = pickle.load(open('models/gb.sav', 'rb'))
  lr = pickle.load(open('models/lr.sav', 'rb'))
  mnb = pickle.load(open('models/mnb.sav', 'rb'))
  rf = pickle.load(open('models/rf.sav', 'rb'))
  svc = pickle.load(open('models/svc.sav', 'rb'))
  sgd = pickle.load(open('models/sgd.sav', 'rb'))

  # dump them into a list
  def get_models():
      models = list()
      
      models.append(('bnb', bnb))
      models.append(('gb', gb))
      models.append(('lr', lr))
      models.append(('mnb', mnb))
      models.append(('rf', rf))
      models.append(('sgd', sgd))
      models.append(('svc', svc))
      
      return models

  # evaluate each base model
  def evaluate_models(models, X_test, y_test):
      # fit and evaluate the models
      scores = list()
      for name, model in models:
          # predict the test set
          yhat = model.predict(X_test)
          # find the accuracy
          acc = accuracy_score(y_test, yhat)
          # store the performance
          scores.append(acc)
          # report model performance
      return scores

  # get models
  models = get_models()

  # get models weights
  scores = evaluate_models(models, X_test, y_test)

The scores variable contains all models performance on the test set which will play a key part in the voting later on.

After that VotingClassifier was implemented which estimators were the trained models previously and weights were the scores evaluated previously as well.

from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(estimators = models, voting = 'soft', weights = scores)

Ensembles Comparison

The aim of this paper was to develop an AI system that helps in Software Defect Prediction by applying advanced techniques in order to come up with the best performance. The paper also compared between several ensemble models which are W-SVM and RandomForest.

Ensemble Model	Test Accuracy
APE (proposed)	%82.81
Random Forest	%81.20
Weighted Support Vector Machines	%72.89

Contribute

The paper was presented by Dr. Lahouari Ghouti Associate Professor in Computer Science at Prince Sultan University. Dr. Lahouari presented the paper to his students in the course Software Analytics (SE480) and the paper was implemented by one of his students Mohammed Abed later on.

mabedd/Software-Defect-Prediction-Ensemble