A pipeline that collects, preprocesses and labels raw user-item interaction data to build a hybrid recommender system using collaborative filtering (SVD, ALS) and learning-to-rank (XGBoost ranking) methods, evaluated with NDCG and MAP metrics.
End-to-end features:
- Data collection & labeling: simulated web scraping and synthetic data generation to mimic user-item interactions.
- Data preprocessing: techniques for handling missing values, duplicates and outliers.
- Collaborative filtering: implementation of matrix factorization approaches (SVD, ALS) using the Surprise library.
- Learning-to-rank: XGBoost with ranking objectives to refine recommendations.
- Evaluation metrics: calculation of ranking metrics including Normalized Discounted Cumulative Gain and Mean Average Precision.
- Visualization & analysis: tools for in-depth performance evaluation and visualization.
Structure:
- data_pipeline.py contains modules for synthetic data generation, web scraping simulation and preprocessing.
- recommender.py implements collaborative filtering models (SVD, ALS) and the learning-to-rank model using XGBoost.
- metrics.py contains functions for calculating NDCG, MAP and other ranking metrics.
- evaluation.py provides evaluation functions and visualization routines to assess model performance.
- utils.py: general utility functions including logging, data splitting and configuration management.
- main.py: the main driver script that integrates the entire pipeline.
Ensure you have Python 3.8+ installed. Install the required packages using:
pip install -r requirements.txt
To run the full pipeline:
python main.py
Apache 2.0