/bigdata

Big Data project analyzing forum questions on Stack Exchange. Predicts accepted answers using Stack Exchange Data Dump from archive.org.

Primary LanguageJupyter Notebook

Forum Question Analyzer

This project utilizes Big Data techniques to analyze questions from Stack Exchange forum. Using the Stack Exchange Data Dump from archive.org, it predicts whether a question will receive an accepted answer.

Data Source

The data is obtained from the Stack Exchange Data Dump, available here. Our project uses the TeX forum data.

Setup

  1. Download the Stack Exchange Data Dump from archive.org.
  2. Extract the data dump into the tex.stackexchange.com folder.
  3. Install Python 3.8 or higher.
  4. Install Spark (3.5.0 recommended).
  5. Install the required dependencies: pip install -r requirements.txt.
  6. Run jupyter notebook and open analysis.ipynb, features.ipynb or statistics.ipynb to see the results of our analysis.

Results

Our model achieved an accuracy of 70.92% in predicting accepted answers.

Contributors

  • Krzysztof Mizgała
  • Julia Czerniecka
  • Wiktoria Gałdusińska
  • Jerzy Grunwald
  • Maciej Kosierb

Feel free to contribute and improve our project!