Machine Learning project to classify text critics from IMBD, with binary and multi-class classification
This practical work is based on solving a set of tasks, associated with a database with texts from IMBD movie reviews. The sklearn library was used.
- For this practical work the data contains 40,000 text documents, rated with a score on a scale from one to ten, where neutral reviews, with scores of 5 and 6, were excluded.
- To be able to perform machine learning techniques such as classification and clustering on text data, it is necessary to represent each document by a numerical vector, for this the Bag of Words model and the tf-idf method were used.
- Based on the text documents, three main tasks were carried out, respectively:
- Determine whether the criticism is positive or negative (binary classification process).
- Predict the score of the critique, in a value between 1-4 and 7-10 (multi-class classification process).
- Find through common words between different texts, groups of documents that address similar areas or themes (clustering similar areas or themes (clustering process).