/YelpRecommendation

Matching benchmark performance in recommendation systems with Yelp 2018 dataset

Primary LanguagePython

YelpRecommendation

Introduction

This project focuses on matching benchmark performance in recommendation systems using the Yelp 2018 dataset. The dataset includes detailed reviews, user profiles, and business metadata, which are crucial for personalized recommendation systems.

Data Overview

  • Reviews: Text reviews and ratings from users for various businesses.
  • Users: Demographic and preference information of users.
  • Businesses: Attributes of businesses including location, category, and operational hours.

Models Implemented

  • Collaborative Filtering: Predicts user preferences based on user-item interactions.
    • Collaborative Denoising Auto-Encoders (2016) applies Denoising Auto-Encoders (DAE) to top-N recommendation systems, generalizing various collaborative filtering (CF) models. Unlike AutoRec from 2015, CDAE incorporates a user node and uses corrupted input preferences.
  • Matrix Factorization: Reduces the dimensionality of the interaction matrix to uncover latent features.
  • Deep Neural Networks: Leverages deep learning to enhance prediction accuracy using complex feature interactions.
  • Sequential Models: Predicts users' next item choice based on their past behaviors
  • Graph-Convolution Models: capture high-order interactions between users and items, enabling efficient batch-level computation

Our goal is to provide a robust analysis of these models and evaluate their performance comprehensively.

Project Structure

.
├── README.md
├── __init__.py
├── configs
│   ├── cdae_sweep_config.yaml
│   ├── data_preprocess.yaml
│   ├── mf_sweep_config.yaml
│   ├── sweep_config.yaml
│   └── train_config.yaml
├── data
│   ├── __init__.py
│   ├── data_preprocess.py
│   ├── datasets
│   │   ├── __init__.py
│   │   ├── cdae_data_pipeline.py
│   │   ├── cdae_dataset.py
│   │   ├── data_pipeline.py
│   │   ├── dcn_data_pipeline.py
│   │   ├── dcn_dataset.py
│   │   ├── mf_data_pipeline.py
│   │   ├── mf_dataset.py
│   │   ├── ngcf_data_pipeline.py
│   │   ├── ngcf_dataset.py
│   │   ├── poprec_data_pipeline.py
│   │   ├── poprec_dataset.py
│   │   ├── s3rec_data_pipeline.py
│   │   └── s3rec_dataset.py
├── loss.py
├── metric.py
├── models
│   ├── base_model.py
│   ├── cdae.py
│   ├── dcn.py
│   ├── mf.py
│   ├── ngcf.py
│   ├── s3rec.py
│   └── wdn.py
├── poetry.lock
├── pyproject.toml
├── train.py
├── trainers
│   ├── __init__.py
│   ├── base_trainer.py
│   ├── cdae_trainer.py
│   ├── dcn_trainer.py
│   ├── mf_trainer.py
│   ├── ngcf_trainer.py
│   ├── poprec_trainer.py
│   └── s3rec_trainer.py
└── utils.py

Development Environment

To run this project, you will need:

  • Python 3.11+: Ensure Python version is up to date for compatibility.
  • Jupyter Notebook: For interactive data analysis and visualizations.
  • Required Libraries: pandas, numpy, scikit-learn, tensorflow/pytorch (depending on model choice).
  • Operating System: Compatible with Windows, macOS, and Linux.

Technology Stack

Python PyTorch Pandas NumPy Vim Google Cloud

Model Performance Comparison

The following table shows the performance of different models used in the project. Each model was evaluated based on multiple metrics:

Model MAP@10 Precision@10 Recall@10 NDCG@10 HIT@10 MRR
CDAE 0.02222 0.01538 0.0713 0.02198 - -
DCN 0.0004 0.0004 0.0016 0.0005 - -
NGCF 0.0002 0.0001 0.0006 0.0002 - -
S3Rec - - - 0.1743 0.3134 0.1537
Multi-armed bandit - - - - - -

These results were obtained from the Yelp 2018 dataset under controlled test conditions.

How to Run

Prerequisites

  • Python >= 3.11
  • Poetry >= 1.8.2
  • Pytorch
# set environments
$ poetry install
$ poetry shell

# generate input data
# download data from [yelp official website](https://www.yelp.com/dataset/download) and set data directory in config
$ vi configs/data_preprocess.yaml
$ python data/data_preprocess.py

# train model
$ vi configs/train_config.yaml
$ python train.py

Contributors

이주연 조성홍