This project is a part of SFU's CMPT 353 Summer 2024, Computational Data Science.
In this project, we aimed to develop a machine learning model capable of categorizing posts from r/AmItheAsshole, where user's submit a story and are told by the community whether or not they are the 'A-hole'. We collected data from 2022-2023 and narrowed it down to the two post flair categories that were most common: those being “YTA”(You’re The Asshole) or “NTA” (Not The Asshole). Our goal is to reliably predict the consensus of the community based on the content of each submission by categorizing it into one of those 2 categories.
- Text Embeddings
- Machine Learning
- Data Visualization
- Predictive Modeling
- Languages: Python (3.11)
- Libraries/Frameworks: scikit-learn, torch, PySpark, pandas, pickle, Streamlit, OpenAI
- Tools: Apache Hadoop, Jupyter Notebook, GitHub
-
From a terminal shell, clone this repository then navigate to the project's root directory.
git clone git@github.sfu.ca:mgl11/AITA-Predictor.git cd AITA-Predictor
-
Create and activate a virtual environment. Then install the required packages:
-
Using
venv
:python -m venv env source env/bin/activate # On Windows use: env\Scripts\activate pip install -r requirements.txt
-
Using
conda
:conda create --name aita_predictor python=3.11 conda activate aita_predictor pip install -r requirements.txt
-
-
A couple of the first few files in the pipeline,
0-get-reddit-data.py
and2-convert-openai-embeddings.py
, are lengthy and cost-incurring. They do not need to be executed more than once. If you do not want to execute at least one of them, you can download the data generated by these scripts from this Google Drive link. The table below shows for each Python file, the data that is needed for execution. If you decide to skip steps 0-2 in the pipeline, you only need to downloadopenai_embedded_large.pkl
. Place any of the data you download in theoutput/
directory. Theopenai_embedded_large.pkl
in the drive is a larger data set than the one currently in the git repo, when prompted, replace the existing one.Python File reddit-subset/
filtered_not _balanced.json.gz
filtered_and _balanced.json.gz
openai_embedded_ large.pkl
0-get-reddit-data.py
❌ ❌ ❌ ❌ 1-unload-data.py
✅ ❌ ❌ ❌ 1.5-balance-data.py
❌ ✅ ❌ ❌ 2-convert-openai-embeddings.py
❌ ❌ ✅ ❌ 3-predict.ipynb
❌ ❌ ❌ ✅ 4-model-ui.py
❌ ❌ ❌ ✅ -
If you want to run
2-convert-openai-embeddings.py
, you will need to obtain an OpenAI API key and set the.env
variableOPENAI_KEY
. You can do this in the terminal from the root directory:echo "OPENAI_KEY=your_openai_key_here" > .env
This section provides detailed instructions on how to execute each file. As mentioned earlier, pipeline steps 0-2 are lengthy and cost-incurring and can be skipped entirely by downloading the openai_embedded_large.pkl
file and placing it in the output/
directory.
-
Executing
0-get-reddit-data.py
:This script requires connecting to this course's compute cluster by SSH. Once you have connected to it, execute the following command to retrieve the
output/submissions/
folder of zipped json files of reddit data:spark-submit 0-get-reddit-data.py
Modify the code in
0-get-reddit-data.py
by changing occurrences of '2023' to '2022' and repeat the step above. -
Executing
1-unload-data.py
: This script processes the raw Reddit data, filters by removing posts that were deleted, had$< 10$ comments, or did not have a "NTA" or "YTA" flair, and outputs the resulting DataFrame to single zipped json file,output/filtered_not_balanced.json.gz
.spark-submit 1-unload-data.py
1.5 Executing
1.5-balance-data.py
:** This script balances the data from the previous step with random selection for better model performance and saves the balanced data to a new zipped json file,output/filtered_and_balanced.json.gz
.python 1.5-balance-data.py
-
Executing
2-convert-openai-embedding.py
:Running this step requires an OpenAI API key in a
.env
file and it can incur a small cost. This notebook converts text data into embeddings using OpenAI's API, specifically theirtext-embedding-3-large
model for text vector embeddings.If you want to run it, open the Jupyter notebook and run all cells. Ensure you have your OpenAI API key set up in a
.env
file. This program will output the entire dataset with vector embeddings to a .pkl file,output/openai_embedded_large.pkl
.python 2-convert-openai-embedding.py
-
Alternatively, you can run
2-convert-embedding.py
to obtain the text embedding vectors without using the OpenAI API. This file contains the method we initially wrote to chunk text blocks by sentence, calculate a vector embedding for each sentence, and use some form of aggregation on the sentence vectors to obtain one vector per data point. Creating the embeddings this way will decrease the model's accuracy score. This program will output the entire dataset with vector embeddings to a .pkl file,output/paraphrase_mini_l6_embedded_averaged.pkl
.python 2-convert-embedding.py
-
Alternatively, you can run
-
Executing
3-predict.ipynb
:Open the Jupyter Notebook and run all cells. This notebook handles all of our model definition, training, and validation.
-
Executing
4-model-ui.py
: The Streamlit app you run in this step is hosted here, so you can skip this step by visiting the site. If there are any issues when accessing the site, you can contact us and we will provide an OpenAI API key so that you can run the app locally. This step also requires an OpenAI API key in a.env
file and can incur small costs. The script runs a Streamlit app that provides a user interface for making predictions with the model. To run the app locally, use the following command:streamlit run 4-model-ui.py
Once the app is running, you can navigate to
http://localhost:8501
in your web browser to interact with the application.
- Marco Lanfranchi
- Nima Seifi
- Paul Atwal
- If there are any questions or issues running code, please feel free to contact any of the group members.