giorgossideris/movie_plots_nlp

Natural Language Processing project based on movie plots.

Jupyter Notebook

movie_plots_nlp

Purpose

This is an NLP (Natural Language Processing) project focused on movie plots. Different kind of models are going to be created with different objective.

Tools and Packages

The code was run on Jupyter (Jupyter Notebook 7.0.3 and Python 3.11.5 will work for sure).

Additional packages required for the project to run are:

pandas
scikit-learn
NumPy
matplotlib
SciPy
seaborn
nltk
spaCy (also the en_core_web_md model of spaCy must be installed by running python -m spacy download en_core_web_sm)
imblearn
xgboost
Tensorflow

All the packages above can be installed using the pip install command-line command.

Data

The data were obtained by Kaggle and the Wikipedia Movie Plots dataset specifically.
In more detail, this dataset contains the Plots and Genres of about 35000 movies from around the world, that were scraped from Wikipedia.

Contents

Notebooks

data_preparation: includes the preparation of data for the various Machine Learning tasks.
genre_classification: this task focuses on performing classification of movies into genres based on their plots, with a specific emphasis on the drama and comedy categories.

Data (included into the `./data` folder)

wiki_movie_plots_deduped.csv: the initial data as were taken from Kaggle.

Data for classification (included into the `./data/classification` folder)

This folder includes the data prepared for the classification task.

genre_encoding.pickle: the encoding of genre ids (int) to genre names (string).
genres_encoded.npy: a numpy array including the encoding of the genres (in the same order as the corresponding plots).
plots.npy: a numpy array including the encoding of the genres.
cleaned_plots.npy: the plots of the plots.npy file after cleaning was applied.