This repository contains code and data for building a movie recommendation system. I designed the system to recommend movies based on user preferences and movie attributes. In this README, I will provide an overview of the data preprocessing steps and the structure of the code.
The data used in this project consists of two main datasets: credits.csv
and movies.csv
. Here is some basic information about these datasets:
-
credits.csv
: Contains information about the cast and crew of each movie.- Shape: (4803, 4)
- Columns: 'movie_id', 'title', 'cast', 'crew'
-
movies.csv
: Contains information about movies, including titles, overviews, genres, keywords, and original language.- Shape: (4803, 20)
- Columns: 'movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew', and more.
I performed several preprocessing steps on the data to prepare it for building the recommendation system. Here are the key preprocessing steps:
-
Handling Missing Values: I removed rows with missing values in the 'overview' column.
-
Data Cleaning: I cleaned the text data in the 'overview' column by removing punctuation and converting text to lowercase.
-
Feature Engineering: I extracted relevant features from the data, such as genres, keywords, cast, and crew, and transformed them into tags.
-
Tag Generation: Tags were generated by combining information from different columns, such as the movie overview, genres, cast, crew, and keywords.
-
Tag Normalization: All tags were converted to lowercase for consistency.