Making the most of small Software Engineering datasets with modern machine learning

Using the pre-trained models in your own script

StackOBERTflow

We released and uploaded the model to Huggingface's model hub (StackOBERTflow-comments-small-v1) The model can thus be used directly through the transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("giganticode/StackOBERTflow-comments-small-v1")
model = AutoModelForMaskedLM.from_pretrained("giganticode/StackOBERTflow-comments-small-v1")

Using pipeline

from transformers import pipeline
from pprint import pprint

COMMENT = "You really should not do it this way, I would use <mask> instead."

fill_mask = pipeline(
    "fill-mask",
    model="giganticode/StackOBERTflow-comments-small-v1",
    tokenizer="giganticode/StackOBERTflow-comments-small-v1"
)

pprint(fill_mask(COMMENT))
# [{'score': 0.019997311756014824,
#   'sequence': '<s> You really should not do it this way, I would use jQuery instead.</s>',
#   'token': 1738},
#  {'score': 0.01693696901202202,
#   'sequence': '<s> You really should not do it this way, I would use arrays instead.</s>',
#   'token': 2844},
#  {'score': 0.013411642983555794,
#   'sequence': '<s> You really should not do it this way, I would use CSS instead.</s>',
#   'token': 2254},
#  {'score': 0.013224546797573566,
#   'sequence': '<s> You really should not do it this way, I would use it instead.</s>',
#   'token': 300},
#  {'score': 0.011984303593635559,
#   'sequence': '<s> You really should not do it this way, I would use classes instead.</s>',
#   'token': 1779}]

Fine-tuned models

Download the zipped model from the release page and extract to model/path. The following models are available:

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("model/path")
model = AutoModelForMaskedLM.from_pretrained("model/path")

Tasks

The datasets cover 9 tasks, briefly described in the following.

Sentiment Classification

Classify the sentiment of Software Engineering artifact (e.g., Stack Overflow posts, app reviews, bug report comments etc.) Each example is to be classified as having either positive, negative or neutral sentiment.

We use the Senti4SD and SentiData datasets.

Examples

Example	Label
I want them to resize based on the length of the data they’re showing.	neutral
When I run my client, it throws the following exception.	negatve
This is always a really bad way to design software.	negative
amazing! a must have app	positive

App Review Classification

Classify app reviews into various categories (e.g., rating, feature request, bug report etc.) or detect whether reviews are informative or not. We use the datasets from ARMINER, MAST and CLAP.

Examples

Example	Label
not able to download any pictures please fix these bugs immediately	informative
Best game I’ve played on Android	rating
good but... it has ads...please remove ads from this...	usability

Self-admitted technical debt detection

Detect whether a comment contains self-admitted technical debt (often indicated by FIXME or TODO). We use the dataset by Maldonado et. al.

Examples

Example	Label
// FIXME: Is "No Namespace is Empty Namespace" really OK?	positive
// Can return null to represent the bootstrap class loader.	negative

Comment classification

Classify comments according to a pre-defined taxonomy (e.g., usage, license, deprecation, ownership). We rely on the dataset by Pascarella et. al.

Examples

Example	Label
// @return a string for throwing	usage
// New button,purpose	summary
// Caller of this method must hold that lock.	rationale

Code-Comment Coherence

Determine whether there is "coherence" between a given method and is corresponding lead comment, that is, whether the comment is describtive of the method. We use the dataset by Corazza et. al.

Example

/**
* Returns the current number of milk units in
* the inventory.
* @return int
Code-Comment Coherence
*/
Prediction [25]
public int getMilk() {
  return milk;
}

Label:: positive (coherent)

/**
   * Check inventory user interface that processes input.
   */
public static void checkInventory() {
  System.out.println(coffeeMaker.checkInventory());
  mainMenu();
}

Label: negative (incoherent)

Linguistic Smell Detection

Detect linguistic smells in code, that is misleading identifier names or the violations of common naming conventions. Our work is based on the dataset by Fakhoury et. al.

Example

public void ToSource(StringBuilder sb) {
  sb.append(";");
  this.NewLine(sb);
}

Label: smelly (transform method does not return)

Code complexity classification

Classify the algorithmic complexity of various algorithm implementations (e.g., O(1), O(n*log(n)) etc.). We use the dataset by Sikka et. al.

Example

class GFG {
  static int minJumps(int arr[], int n) {
    int[] jumps = new int[n];
    int min;
    jumps[n - 1] = 0;
    for (int i = n - 2; i >= 0; i--) {
      if (arr[i] == 0)
        jumps[i] = Integer.MAX_VALUE;
      else if (arr[i] >= n - i - 1) 
        jumps[i] = 1;
      else { ... }
    }
    return jumps[0];
  }
  public static void main(String[] args) {...}
}

Label: O(n log n)

Code readability classification

Given a piece of code, classify it as either "readable" or "not readable". Our work relies on the dataset by Scalabrino et. al.

Example

@Override
public void configure(Configuration cfg) {
  super.configure(cfg);
  cfg.setProperty(Environment.USE_SECOND_LEVEL_CACHE, ...);
  cfg.setProperty(Environment.GENERATE_STATISTICS, ...);
  cfg.setProperty(Environment.USE_QUERY_CACHE, "false" );
  ... // more cfg.setProperty calls
}

Label: readable

Reproducing results

There is a module for each task/dataset. Currently, these are:

Module	Description	Dataset URL
`ar_miner`	Informative app reviews	https://github.com/jinyyy666/AR_Miner/tree/master/datasets
`coherence`	Code-comment coherence	http://www2.unibas.it/gscanniello/coherence/
`comment_classification`	Comment classification	https://zenodo.org/record/2628361
`corcod`	Runtime complexity classification	https://github.com/midas-research/corcod-dataset
`readability`	Code readability classification	https://dibt.unimol.it/report/readability/
`review_classification`	Review classification	https://mast.informatik.uni-hamburg.de/wp-content/uploads/2014/03/REJ_data.zip
`satd`	Self-admitted debt detection	https://github.com/maldonado/tse.satd.data
`senti4sd`	Sentiment analysis on Stack Overflow comments	https://github.com/collab-uniba/Senti4SD
`smell_detection`	Linguistic smell detection	https://github.com/Smfakhoury/SANER-2018-KeepItSimple-

Some of the used datasets (e.g., CLAP) are not publicly avaiable. Datasets were preprocessed and brought into a standard format. If you like to rerun one of the experiments, please contact one of the authors for the dataset in the correct format. Datasets must be placed in /data/<module>/. Training parameters can be set in /dl4se/config/<module>, dataset loading is handled in /dl4se/datasets/<module>.

Additional configuration parameters can be passed on the command line. See the config.py file of the corresponding module for a list of possible command line options.

To run an experiment execute the following:

python -mdl4se.experiments.<module>.default --seeds 100 200 300 400 500 --out_file=result_file.csv

giganticode/small-datasets-ml-resources

Making the most of small Software Engineering datasets with modern machine learning

Using the pre-trained models in your own script

StackOBERTflow

Fine-tuned models

Tasks

Sentiment Classification

Examples

App Review Classification

Examples

Self-admitted technical debt detection

Examples

Comment classification

Examples

Code-Comment Coherence

Example

Linguistic Smell Detection

Example

Code complexity classification

Example

Code readability classification

Example

Reproducing results