====================
The aim of this project was to build a prediction model that would be able to predict whether the question posted on [stackoverflow.com] (http://stackoverflow.com/) was going to be closed or not.
Stackoverflow is a website where users can ask questions on various topics in computer programming, answer other users’ questions, and earn points and badges by actively participating in the community. To prevent low quality questions stackoverflow has been using a closing questions mechanism since 2013, which allows experience community members to mark a question closed if they estimate it not to be fit for the website. A question can be marked closed for five reasons:
- duplicate,
- off-topic,
- unclear-what-you’re-asking,
- too broad and
- primarily-opinion-based.
The project workflow consisted of the following steps:
-
Collecting the data relevant for the project using [StackExchangeAPI] (https://api.stackexchange.com/)
-
Processing collected data and adding features for classification
-
Creating the dataset
-
Applying machine learning techniques for classification
-
Evaluating classification results
-
Collecting data ==========
500 closed questions and 500 not closed questions were collected for the purposes of the project through the StackExchangeAPI. The questions were collected using the /search method with the following parmeters:
- fromDate: 1404172800 (1/7/2014)
- toDate: 1419984000 (31/12/2014)
- closed: true for closed questions, false for not closed questions
- filter: withBody, in order to get bodies of the questions
- accessToken and key obtained by registering to the api, in order to increase the daily request quota.
The results were saved to files closedQuestions.json and notClosedQuestions.json.
- Adding features ============ In the next step, each question was added features for classification. The features can be divided into four groups:
Group | Name | Features |
---|---|---|
A | User Profile | age_of_account, badge_score, posts_with_negative_score |
B | Community Process | post_score, accepted_answer_score, comment_score |
C | Question Content | number_of_urls, number_of_stackoverflow_urls |
D | Textual Style | title_length, body_length, number_of_tags, number_of_punctuation_marks, number_of_short_words, number_of_special_characters, number_of_lower_case_characters, number_of_upper_case_characters, code_snippet_length |
Features of group A are related to user’s profile and participation activities in the community, whereas features of group B are based on contributions to the community in the form of votes, answers, etc. Group C contains features related to question content, and features of group D describe the textual style of the question title and body. Most of the features are self-describing, although some of them require further explanation:
- Badge score
Let {b1 , … , bn} be the badges earned by the user. Then:
-
Post score
Let {q1 , … , qn} be the set of questions asked by the user, and {a1 , … , am} the set of answers posted by the user. Then:
-
Comment score
Let {c1 , … , cn} be the comments posted by the user. Then:
-
Accepted answer score
Let {a1 , … , an} be the set of answers posted by the user which have been accepted. Each acepted answer has the score of 15, therefore:
The following api methods were used to collect the necessary data:
- [/users/{ids}] (http://api.stackexchange.com/docs/users-by-ids) – the method which returns data about user with the requested id
- /users/{ids}/badges – returns the badges owned by the user with the requested id
- /badges/{ids} – returns data about badge with the requested id
- /users/{ids}/questions – returns the questions that the requested user posted
- /users/{ids}/answers – returns the answers that the requested user posted
- /users/{ids}/comments – returns the comments that the requested user posted
After adding the features, the questions were saved to files closedQuestionsWithFeatures.json and notClosedQuestionsWithFeatures.json.
- Creating the dataset =====================
The next step included creating the dataset from collected questions using weka api. The dataset contains 18 attributes: 17 are numeric (the features), and the 18th is the class attribute with possible values closed or not_closed, the one whose value the program is aimed to predict. The dataset was saved to file dataSet.arff, and later divided into two datasets – one for training, with 80% of data (trainingSet.arff), and the other for testing, with 20% of data (testSet.arff).
- Applying machine learning techniques for classification =======================
The dataset was first loaded from the .arff file, and since it contained numeric attributes it needed to be discretized. This was done using the weka Discretize filter. After that the FilteredClassifier was built with Discretize filter and one of the classifiers classifier. Three classifiers were used for classification:
- Evaluation of the results ==================== All the clasiffiers were evaluated first using the training dataset and later with the test dataset. Their results were as follows:
Naive Bayes
DataSet | Correctly classified instances % | Precision | Recall | F1 |
---|---|---|---|---|
Training | 82.875 | 0.829 | 0.829 | 0.829 |
Test | 77 | 0.771 | 0.77 | 0.77 |
Confusion matrix:
a | b | <-- classified as |
---|---|---|
74 | 26 | a (closed) |
20 | 80 | b (not_closed) |
Support Vector Machines
DataSet | Correctly classified instances % | Precision | Recall | F1 |
---|---|---|---|---|
Training | 96.875 | 0.969 | 0.969 | 0.969 |
Test | 86.5 | 0.869 | 0.865 | 0.865 |
Confusion matrix:
a | b | <-- classified as |
---|---|---|
81 | 19 | a (closed) |
8 | 92 | b (not_closed) |
Logistic Regression
DataSet | Correctly classified instances % | Precision | Recall | F1 |
---|---|---|---|---|
Training | 100 | 1 | 1 | 1 |
Test | 82.5 | 0.825 | 0.825 | 0.825 |
Confusion matrix:
a | b | <-- classified as |
---|---|---|
82 | 18 | a (closed) |
17 | 83 | b (not_closed) |
Logistic Regression classifier had the best results on training data, with 100% correctly classified instances. On the test dataset, Support Vector Machines was the best with 86.5 % corectly classified instances.
- Technical realisation =============================
The application was written in Java programming language, using Eclipse Juno IDE. The following libraries were used:
- gson-2.2.4.jar – Java library used to convert Java objects into their JSON representation, and vice versa.
- weka-3.7.3.jar - Java library with a collection of machine learning algorithms included in Weka used to create a predictive classification model
- [httpclient-4.5] (https://hc.apache.org/downloads.cgi)
- Acknowledgements ========================
The project was developed as part of the project assignment for the course Intelligent Systems at the Faculty of Organization Sciences, University of Belgrade, Serbia. Ideas and guidelines for the project were found in the work [Fit or Unfit : Analysis and Prediction of ‘Closed Questions’] (http://arxiv.org/abs/1307.7291).