/AITA-Predictor

Group Project for CMPT 353 Summer 2024

Primary LanguagePython

AITA Predictor

This project is a part of SFU's CMPT 353 Summer 2024, Computational Data Science.

Project Objective

In this project, we aimed to develop a machine learning model capable of categorizing posts from r/AmItheAsshole, where user's submit a story and are told by the community whether or not they are the 'A-hole'. We collected data from 2022-2023 and narrowed it down to the two post flair categories that were most common: those being “YTA”(You’re The Asshole) or “NTA” (Not The Asshole). Our goal is to reliably predict the consensus of the community based on the content of each submission by categorizing it into one of those 2 categories.

Methods Used

  • Text Embeddings
  • Machine Learning
  • Data Visualization
  • Predictive Modeling

Technologies

  • Languages: Python (3.11)
  • Libraries/Frameworks: scikit-learn, torch, PySpark, pandas, pickle, Streamlit, OpenAI
  • Tools: Apache Hadoop, Jupyter Notebook, GitHub

Getting Started

  1. From a terminal shell, clone this repository then navigate to the project's root directory.

    git clone git@github.sfu.ca:mgl11/AITA-Predictor.git 
    cd AITA-Predictor
  2. Create and activate a virtual environment. Then install the required packages:

    • Using venv:

      python -m venv env
      source env/bin/activate  # On Windows use: env\Scripts\activate
      pip install -r requirements.txt
    • Using conda:

      conda create --name aita_predictor python=3.11
      conda activate aita_predictor
      pip install -r requirements.txt
  3. A couple of the first few files in the pipeline, 0-get-reddit-data.py and 2-convert-openai-embeddings.py, are lengthy and cost-incurring. They do not need to be executed more than once. If you do not want to execute at least one of them, you can download the data generated by these scripts from this Google Drive link. The table below shows for each Python file, the data that is needed for execution. If you decide to skip steps 0-2 in the pipeline, you only need to download openai_embedded_large.pkl. Place any of the data you download in the output/ directory. The openai_embedded_large.pkl in the drive is a larger data set than the one currently in the git repo, when prompted, replace the existing one.

    Python File reddit-subset/ filtered_not _balanced.json.gz filtered_and _balanced.json.gz openai_embedded_ large.pkl
    0-get-reddit-data.py
    1-unload-data.py
    1.5-balance-data.py
    2-convert-openai-embeddings.py
    3-predict.ipynb
    4-model-ui.py
  4. If you want to run 2-convert-openai-embeddings.py, you will need to obtain an OpenAI API key and set the .env variable OPENAI_KEY. You can do this in the terminal from the root directory:

    echo "OPENAI_KEY=your_openai_key_here" > .env

Running the Project

This section provides detailed instructions on how to execute each file. As mentioned earlier, pipeline steps 0-2 are lengthy and cost-incurring and can be skipped entirely by downloading the openai_embedded_large.pkl file and placing it in the output/ directory.

  1. Executing 0-get-reddit-data.py:

    This script requires connecting to this course's compute cluster by SSH. Once you have connected to it, execute the following command to retrieve the output/submissions/ folder of zipped json files of reddit data:

    spark-submit 0-get-reddit-data.py

    Modify the code in 0-get-reddit-data.py by changing occurrences of '2023' to '2022' and repeat the step above.

  2. Executing 1-unload-data.py: This script processes the raw Reddit data, filters by removing posts that were deleted, had $< 10$ comments, or did not have a "NTA" or "YTA" flair, and outputs the resulting DataFrame to single zipped json file, output/filtered_not_balanced.json.gz.

    spark-submit 1-unload-data.py

    1.5 Executing 1.5-balance-data.py:** This script balances the data from the previous step with random selection for better model performance and saves the balanced data to a new zipped json file, output/filtered_and_balanced.json.gz.

    python 1.5-balance-data.py
  3. Executing 2-convert-openai-embedding.py:

    Running this step requires an OpenAI API key in a .env file and it can incur a small cost. This notebook converts text data into embeddings using OpenAI's API, specifically their text-embedding-3-large model for text vector embeddings.

    If you want to run it, open the Jupyter notebook and run all cells. Ensure you have your OpenAI API key set up in a .env file. This program will output the entire dataset with vector embeddings to a .pkl file, output/openai_embedded_large.pkl.

    python 2-convert-openai-embedding.py
    • Alternatively, you can run 2-convert-embedding.py to obtain the text embedding vectors without using the OpenAI API. This file contains the method we initially wrote to chunk text blocks by sentence, calculate a vector embedding for each sentence, and use some form of aggregation on the sentence vectors to obtain one vector per data point. Creating the embeddings this way will decrease the model's accuracy score. This program will output the entire dataset with vector embeddings to a .pkl file, output/paraphrase_mini_l6_embedded_averaged.pkl.
      python 2-convert-embedding.py
  4. Executing 3-predict.ipynb:

    Open the Jupyter Notebook and run all cells. This notebook handles all of our model definition, training, and validation.

  5. Executing 4-model-ui.py: The Streamlit app you run in this step is hosted here, so you can skip this step by visiting the site. If there are any issues when accessing the site, you can contact us and we will provide an OpenAI API key so that you can run the app locally. This step also requires an OpenAI API key in a .env file and can incur small costs. The script runs a Streamlit app that provides a user interface for making predictions with the model. To run the app locally, use the following command:

    streamlit run 4-model-ui.py

    Once the app is running, you can navigate to http://localhost:8501 in your web browser to interact with the application.

Project Report

Group Members

  • Marco Lanfranchi
  • Nima Seifi
  • Paul Atwal

Contact

  • If there are any questions or issues running code, please feel free to contact any of the group members.