/LTRS-scraper

A pipeline that collects, preprocesses and labels raw user-item interaction data to build a recsys with collaborative filtering (SVD, ALS) and LTR (XGBoost ranking) methods

Primary LanguagePythonApache License 2.0Apache-2.0

A pipeline that collects, preprocesses and labels raw user-item interaction data to build a hybrid recommender system using collaborative filtering (SVD, ALS) and learning-to-rank (XGBoost ranking) methods, evaluated with NDCG and MAP metrics.

End-to-end features:

  • Data collection & labeling: simulated web scraping and synthetic data generation to mimic user-item interactions.
  • Data preprocessing: techniques for handling missing values, duplicates and outliers.
  • Collaborative filtering: implementation of matrix factorization approaches (SVD, ALS) using the Surprise library.
  • Learning-to-rank: XGBoost with ranking objectives to refine recommendations.
  • Evaluation metrics: calculation of ranking metrics including Normalized Discounted Cumulative Gain and Mean Average Precision.
  • Visualization & analysis: tools for in-depth performance evaluation and visualization.

Structure:

  • data_pipeline.py contains modules for synthetic data generation, web scraping simulation and preprocessing.
  • recommender.py implements collaborative filtering models (SVD, ALS) and the learning-to-rank model using XGBoost.
  • metrics.py contains functions for calculating NDCG, MAP and other ranking metrics.
  • evaluation.py provides evaluation functions and visualization routines to assess model performance.
  • utils.py: general utility functions including logging, data splitting and configuration management.
  • main.py: the main driver script that integrates the entire pipeline.



Ensure you have Python 3.8+ installed. Install the required packages using:

pip install -r requirements.txt

To run the full pipeline:

python main.py

License

Apache 2.0