/movie_plots_nlp

Natural Language Processing project based on movie plots.

Primary LanguageJupyter Notebook

movie_plots_nlp

Purpose

  • This is an NLP (Natural Language Processing) project focused on movie plots. Different kind of models are going to be created with different objective.

Tools and Packages

The code was run on Jupyter (Jupyter Notebook 7.0.3 and Python 3.11.5 will work for sure).

Additional packages required for the project to run are:

All the packages above can be installed using the pip install command-line command.

Data

  • The data were obtained by Kaggle and the Wikipedia Movie Plots dataset specifically.
  • In more detail, this dataset contains the Plots and Genres of about 35000 movies from around the world, that were scraped from Wikipedia.

Contents

Notebooks

  • data_preparation: includes the preparation of data for the various Machine Learning tasks.
  • genre_classification: this task focuses on performing classification of movies into genres based on their plots, with a specific emphasis on the drama and comedy categories.

Data (included into the ./data folder)

  • wiki_movie_plots_deduped.csv: the initial data as were taken from Kaggle.

Data for classification (included into the ./data/classification folder)

This folder includes the data prepared for the classification task.

  • genre_encoding.pickle: the encoding of genre ids (int) to genre names (string).
  • genres_encoded.npy: a numpy array including the encoding of the genres (in the same order as the corresponding plots).
  • plots.npy: a numpy array including the encoding of the genres.
  • cleaned_plots.npy: the plots of the plots.npy file after cleaning was applied.