This published dataset, PSC dataset, is corresponded to the paper "An improved CNN model for within-project software defect prediction". Our dataset targets defect prediction in which source code and labeled data are both needed to calculate new features from source code. Feel free to use our dataset, and contributions to the dataset is welcome.

Instruction:

  1. source code.zip: This file contains source codes of 41 versions of 12 projects.

  2. labeled data.zip: This file contains the original labeled data from PROMISE repository. In each csv file, 20 traditional features are listed.

  3. embedded data.zip: This file contains source file AST sequences which are extracted by strategies described in the paper (. restrictedcontent), and the embedded integer sequences with 0 padded (.embed)

  4. mapped data.zip: This file contains source file path (absolute path currently, which is to be fixed in the future) and defect information of the file.

Download via DropBox:

https://www.dropbox.com/s/ne9pd2mq4xz93rq/source%20file.zip?dl=0 soucefile.zip https://www.dropbox.com/s/b429d9icad0xc3j/embedded%20data.zip?dl=0 embedded data.zip https://www.dropbox.com/s/ez3rljc77ej05w9/mapped%20data.zip?dl=0 mapped data.zip https://www.dropbox.com/s/kxwqy3u678lz1x2/labeled%20data.zip?dl=0 labeled data.zip