Resolverflow

Stackoverflow, a programmer's best friend. Well, if you get an answer, and if it is a useful one.

We will perform a data analysis on the StackOverflow dataset to find out how you can best formulate your question. We aim to find features that will help you get a resolution as quick as possible. Features will be ordinal and categorical, taken from the literal dataset values and but also from some custom NLP. Let's make some Stackoverflow clickbait 😎

Dataset

The https://archive.org/download/stackexchange dataset is uploaded to an HDFS. convert_dataset.py converts the xml dataset into a .parquet file that is already partitioned. This way, loading is a lot faster. Parquet structures of all generates file can be found at https://github.com/WeersProductions/resolverflow/blob/master/dataFramePreviews.md .

Project overview

The project is divided into three folders:

features
analysis
analysis/local
util

Features

Responsible for collecting features from the big data set of StackOverflow. Uses spark to fetch the features. Each file contains a group of features and can be spark-submitted on its own to gather these features. However, to run all features at once and combine them into one resulting dataset, run_all.py can be used. Users can define what feature groups should be extracted and it will combine those automatically.

To add a feature, create a new file and add your function definition. It should receive a spark context that can be used to interact with the Spark cluster. This method should return a dataframe at least one column: _Id. _Id is the Id of the post. Note: if you are using PostHistory.parquet as a source for data, be sure to use _PostId and rename the column to _Id.

Analysis

Responsible for analyzing the features after feature collection has been done. This reads from a output_stackoverflow.parquet file which contain the extracted features.

correlation.py
Calculates the correlation between a feature and the label.
decision_tree.py
Contains code to train and evaluate a decision tree (whether as classifier or regressor). Features that should be used can be selected.
swashbuckler.py
Bucketizes the input to be used for graphs.
vif.py
Used to remove features that have a too high VIF. Calculates the vif of pairs of features and also calculates the VIF while removing a single feature from all features.

Analaysis/local

To be run on a local machine. This uses .pickle files (small data) and can generate plots.

qq_plots_plot.py Generates qq plots for features. Different distributions can be plotted against a feature.
swashbuckler_plot.py Generates histograms of a feature for both resolved and unresolved questions.

Util

Utility scripts. Used to e.g. convert parquet files to pickle files, or to join several .parquet files together into a single .parquet file.