A university project
Course: Information Retrieval & Search Engines
A content-based recommendation system for books (see dataset below). The program chooses a random user & takes his best 3 ratings. Based on these 3 ratings, recommends to the user 10 books, which may like. The recommended books are based on 3 factors:
- Books' titles Keyword similarity (Jaccard Similarity or Dice Coefficient)
- Book author equality
- Publish Year Difference
pip install -r requirements.txt
The 3 initial CSV files must be in the same folder with preproc.py. When you run the file for the first time, please uncomment the lines 5 and 6:
# nltk.download('stopwords')
# nltk.download('punkt')
The BX-Whole CSV file must be in the same folder with main.py
From the initial dataset, we remove:
- The books that have less than 10 ratings
- Users that have rated less than 5 books
After the removals mentioned, we create keywords for each book title, based on the title. Specifically, to generate keywords, we perform to each book title:
- Tokenization
- Stop word removal
- Stemming (by default with Porter's algorithm, snowball's also supported)
Every book now has a keywords list, which is attached to dataframe (new column).
At the end, we join all tables (BX-Books, BX-Book-Ratings, BX-Users) into a new dataframe that contains all the information needed.
We create the BX-Whole.csv file, which is going to be the input for our recommendation system.
python preproc.py
python main.py