- This is an NLP (Natural Language Processing) project focused on movie plots. Different kind of models are going to be created with different objective.
The code was run on Jupyter (Jupyter Notebook 7.0.3 and Python 3.11.5 will work for sure).
Additional packages required for the project to run are:
- pandas
- scikit-learn
- NumPy
- matplotlib
- SciPy
- seaborn
- nltk
- spaCy (also the
en_core_web_md
model of spaCy must be installed by runningpython -m spacy download en_core_web_sm
) - imblearn
- xgboost
- Tensorflow
All the packages above can be installed using the pip install
command-line command.
- The data were obtained by Kaggle and the Wikipedia Movie Plots dataset specifically.
- In more detail, this dataset contains the Plots and Genres of about 35000 movies from around the world, that were scraped from Wikipedia.
data_preparation
: includes the preparation of data for the various Machine Learning tasks.genre_classification
: this task focuses on performing classification of movies into genres based on their plots, with a specific emphasis on the drama and comedy categories.
wiki_movie_plots_deduped.csv
: the initial data as were taken from Kaggle.
This folder includes the data prepared for the classification task.
genre_encoding.pickle
: the encoding of genre ids (int) to genre names (string).genres_encoded.npy
: a numpy array including the encoding of the genres (in the same order as the corresponding plots).plots.npy
: a numpy array including the encoding of the genres.cleaned_plots.npy
: the plots of theplots.npy
file after cleaning was applied.