Project Pensive is open source software that leverages the latest AI research to benefit those offering recommendations. The goal of Project Pensive is to build better Recommender Engines that incorporate content conscious filters to improve content quality and user experience.
See also the project charter and technical agenda
To get started:
- Clone this repo:
git clone https://github.com/VectorInstitute/projectpensive.git
- Install code packages:
python setup.py install
- Install package dependencies:
pip install -r requirements.txt
Notes:
- This is a one-off demo, the code repository will not be maintained.
- This code was written with a Python 3.7 interpreter. It is recommended to use Python 3.7 or newer.
To run any training programs or demos, you will need to download the data. Assuming you have a Kaggle account, go to
the dataset and download the zip file. Unzip and
move the csv file to projectpensive/civility/recommender/train-balanced-sarcasm.csv
. Once this is done, traverse to
projectpensive/civility/recommender
and run python preprocess_data.py
. If done correctly, you will have all data you need
to run experiments and the demo. Another dataset that needs to be downloaded can be found
here. Unzip this file, create a folder under diversity diversity/datasets
and add the file there as
projectpensive/diversity/datasets/reddit_user_data_count.csv
. Follow the instructions under the Diversity Filter section below to
correctly preprocess the data.
If you want to run the demo, you will need model checkpoints for the civility filter and the recommender. You will also need precomputed embeddings for the diversity filter.
You will need to train the filters yourself (see Civility Filter and Recommender Engine sections). To acquire the precomputed embeddings for the diversity filter, see the Diversity Filter section.
Code for the civility filter can be found under projectpensive/civility/classifier
. To train a model, run python train.py
.
Code for the diversity filter can be found under projectpensive/diversity
. To generate subreddit embeddings, go to
basio/diversity/embeddings
and run the subreddit2vec.ipynb
file. Once you have generated the vectors.tsv
and metadata.tsv
files and ensured that they are in the datasets
folder, run basio/diversity/generate-subreddit-dataframe.py
. To generate comment embeddings, run the generate-comment-embeddings.py
file.
Code for the recommender engine can be found under projectpensive/civility/recommender
. To train a recommender, run
python train.py
If you have access to the vaughan server, you can run a shell script to download all necessary data and model
checkpoints. To do so, run sh download_data_and_checkpoints your_cluster_id
.
To run the demo, do the following
- Make sure you have the necessary datasets and model checkpoints
- Traverse to demo folder:
cd demo
- Run streamlit app:
streamlit run main.py
- Improving Recommendation Diversity (2001, Keith Bradley and Barry Smyth)
- Improving Recommendation Lists Through Topic Diversification (2005, Cai-Nicolas Ziegler et al.)
- Enhancing Diversity-Accuracy Technique on User-Based Top-N Recommendation Algorithms (2013, Wichian Premchaiswadi et al.)
There is no active support for this repository.
This was work done by Michael Laurent Nasello and Sheen Thusoo during their summer 2021 internship at the Vector Institute with help from Ron Bodkin and Shems Saleh(during her time here at Vector).
This repo has a MIT-style licence, as found in LICENSE file.