/Spam-Movie-Reviews-Detection

Final year project on NLP

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Spam Movie Reviews Detection through Supervised Learning

Chun Sang Au Yong and Wing Yan Wong

In recent years, the growing demand for movies, particularly amid a pandemic promoting indoor entertainment, has led to a surge in the movie industry. However, this growth has also given rise to spam reviews. Paid posters, employed by movie distributors to either promote their movies or attack competitors', post false opinions to manipulate audience opinions. Positive opinions result in significant financial gain, while negative opinions often lead to box office losses [1, 2]. Consequently, review platforms struggle to provide trustworthy guidance to moviegoers. In China, this phenomenon is especially critical due to the prevalence of the Internet Water Army, paid posters common across websites [3].

Previous research on commercial product spam review detection has been fruitful, but few studies have focused primarily on movie reviews, which possess unique word distribution and features. In the context of movie reviews, even fewer studies have examined Chinese reviews. Moreover, previous research outputs cannot effectively combat the prevalence of spam movie reviews in real-life settings [4].

Our project centers on spam reviews on a Chinese movie review site, Douban Movie (www.movie.douban.com), notorious for its pervasive spam reviews and widespread media coverage [5, 6]. By scraping and labelling reviews on the platform, we obtained a dataset for feature engineering, exploratory data analysis, and model training. In Phase 1, we achieved a respectable testing accuracy of over 80%, but there was still room for improvement in terms of features and models.

In Phase 2, we improved data pre-processing by expanding the tokenisation dictionary to include trending slangs, movie names, and cast names. We also enhanced feature extraction by adding linguistics-based features, such as readability, expressiveness, and lexical diversity, which proved useful in spam detection. The sentiment score feature was improved by defining a custom dictionary specific to review text, allowing for more effective identification of emotional polarity in movie reviews.

Furthermore, we explored the potential of neural networks in detecting spam reviews, comparing their performance to top-performing models from our previous study. Although our results with traditional machine learning methodologies were fruitful, we sought to examine the capabilities of neural networks in learning features automatically, potentially yielding better performance despite overfitting caused by a lack of data.

By introducing AI into spam movie review detection, we aim to minimize the presence of spam opinions and restore confidence in review platforms. This is of paramount importance to both consumers and producers, as genuine reviews serve as honest guides for moviegoers and allow film production companies to better assess movie reception through analytics without the noise created by spam reviews.

This project provides significant contributions to the field of spam review detection. Firstly, we created a labelled dataset containing 1600 realistic reviews, enabling further research using supervised models. Our enhanced tokenisation and features not only improve existing classifiers for practical use but also offer insights in psychology and linguistics. By comparing the performance of various models, particularly between classical machine learning models and neural networks, our project can help create more sophisticated spam filters in movie review websites and identify the most effective model in real-life settings. Additionally, researchers from other disciplines can carry out follow-up research on the underlying phenomena behind spam reviews, such as linguistic features unique to spam text, the psychology of deception, and online conversation analysis.