StackExchangePredicativeModel

About the project

====================

The aim of this project was to build a prediction model that would be able to predict whether the question posted on [stackoverflow.com] (http://stackoverflow.com/) was going to be closed or not.

Stackoverflow is a website where users can ask questions on various topics in computer programming, answer other users’ questions, and earn points and badges by actively participating in the community. To prevent low quality questions stackoverflow has been using a closing questions mechanism since 2013, which allows experience community members to mark a question closed if they estimate it not to be fit for the website. A question can be marked closed for five reasons:

duplicate,
off-topic,
unclear-what-you’re-asking,
too broad and
primarily-opinion-based.

The project workflow consisted of the following steps:

Collecting the data relevant for the project using [StackExchangeAPI] (https://api.stackexchange.com/)
Processing collected data and adding features for classification
Creating the dataset
Applying machine learning techniques for classification
Evaluating classification results
Collecting data ==========

500 closed questions and 500 not closed questions were collected for the purposes of the project through the StackExchangeAPI. The questions were collected using the /search method with the following parmeters:

fromDate: 1404172800 (1/7/2014)
toDate: 1419984000 (31/12/2014)
closed: true for closed questions, false for not closed questions
filter: withBody, in order to get bodies of the questions
accessToken and key obtained by registering to the api, in order to increase the daily request quota.

The results were saved to files closedQuestions.json and notClosedQuestions.json.

Adding features ============ In the next step, each question was added features for classification. The features can be divided into four groups:

Group	Name	Features
A	User Profile	age_of_account, badge_score, posts_with_negative_score
B	Community Process	post_score, accepted_answer_score, comment_score
C	Question Content	number_of_urls, number_of_stackoverflow_urls
D	Textual Style	title_length, body_length, number_of_tags, number_of_punctuation_marks, number_of_short_words, number_of_special_characters, number_of_lower_case_characters, number_of_upper_case_characters, code_snippet_length

Features of group A are related to user’s profile and participation activities in the community, whereas features of group B are based on contributions to the community in the form of votes, answers, etc. Group C contains features related to question content, and features of group D describe the textual style of the question title and body. Most of the features are self-describing, although some of them require further explanation:

Badge score

Let {b1 , … , bn} be the badges earned by the user. Then:

Post score

Let {q1 , … , qn} be the set of questions asked by the user, and {a1 , … , am} the set of answers posted by the user. Then:
Comment score

Let {c1 , … , cn} be the comments posted by the user. Then:
Accepted answer score

Let {a1 , … , an} be the set of answers posted by the user which have been accepted. Each acepted answer has the score of 15, therefore:

The following api methods were used to collect the necessary data:

[/users/{ids}] (http://api.stackexchange.com/docs/users-by-ids) – the method which returns data about user with the requested id
/users/{ids}/badges – returns the badges owned by the user with the requested id
/badges/{ids} – returns data about badge with the requested id
/users/{ids}/questions – returns the questions that the requested user posted
/users/{ids}/answers – returns the answers that the requested user posted
/users/{ids}/comments – returns the comments that the requested user posted

After adding the features, the questions were saved to files closedQuestionsWithFeatures.json and notClosedQuestionsWithFeatures.json.

Creating the dataset =====================

The next step included creating the dataset from collected questions using weka api. The dataset contains 18 attributes: 17 are numeric (the features), and the 18th is the class attribute with possible values closed or not_closed, the one whose value the program is aimed to predict. The dataset was saved to file dataSet.arff, and later divided into two datasets – one for training, with 80% of data (trainingSet.arff), and the other for testing, with 20% of data (testSet.arff).

Applying machine learning techniques for classification =======================

The dataset was first loaded from the .arff file, and since it contained numeric attributes it needed to be discretized. This was done using the weka Discretize filter. After that the FilteredClassifier was built with Discretize filter and one of the classifiers classifier. Three classifiers were used for classification:

Evaluation of the results ==================== All the clasiffiers were evaluated first using the training dataset and later with the test dataset. Their results were as follows:

Naive Bayes

DataSet	Correctly classified instances %	Precision	Recall	F1
Training	82.875	0.829	0.829	0.829
Test	77	0.771	0.77	0.77

Confusion matrix:

a	b	<-- classified as
74	26	a (closed)
20	80	b (not_closed)

Support Vector Machines

DataSet	Correctly classified instances %	Precision	Recall	F1
Training	96.875	0.969	0.969	0.969
Test	86.5	0.869	0.865	0.865

Confusion matrix:

a	b	<-- classified as
81	19	a (closed)
8	92	b (not_closed)

Logistic Regression

DataSet	Correctly classified instances %	Precision	Recall	F1
Training	100	1	1	1
Test	82.5	0.825	0.825	0.825

Confusion matrix:

a	b	<-- classified as
82	18	a (closed)
17	83	b (not_closed)

Logistic Regression classifier had the best results on training data, with 100% correctly classified instances. On the test dataset, Support Vector Machines was the best with 86.5 % corectly classified instances.

Technical realisation =============================

The application was written in Java programming language, using Eclipse Juno IDE. The following libraries were used:

gson-2.2.4.jar – Java library used to convert Java objects into their JSON representation, and vice versa.
weka-3.7.3.jar - Java library with a collection of machine learning algorithms included in Weka used to create a predictive classification model
[httpclient-4.5] (https://hc.apache.org/downloads.cgi)

Acknowledgements ========================

The project was developed as part of the project assignment for the course Intelligent Systems at the Faculty of Organization Sciences, University of Belgrade, Serbia. Ideas and guidelines for the project were found in the work [Fit or Unfit : Analysis and Prediction of ‘Closed Questions’] (http://arxiv.org/abs/1307.7291).

Angemon92/StackExchangePredicativeModel

StackExchangePredicativeModel

About the project