Data Science Final Project.

Task Overview.

Assignment.

Build a model for binary (positive / negative) sentiment analysis of movies text reviews. Two datasets train.csv and test.csv were given for model's training and evaluation respectively.

Code overview.

Initial data analysis and experiments were done in the notebooks (./notebooks/ folder):

  • EDA.ipynb
  • data_preprocessing.ipynb
  • models_training_and_evaluation.ipynb

./src folder contains two main folders for building and executing Docker containers for models training (./src/train/train.py) and evaluation (./src/inference/run_interference.py). Supporting utils package with listed scripts was created:

  • config_loading.py: loads necessary sources for initial datasets and defines directories for future models, predictions, and metrics storage from sources.cfg.
  • data_loading.py: loads and unpacks raw data.
  • data_preprocessing.py: set of util functions used for text preprocessing (including features creation, tokenization, lemmatization & stemming, vectorization, etc.)

In ./outputs/predictions/ folder for each of three considered models metrics and predicted datasets are stored. Pickled models are stored in ./outputs/models/ directory. Both ./data and ./outputs directories will be used as volumes for train and test Docker containers.

Pipeline Execution.

For convenient directories definition sources.cfg file must be created with the following structure (links must be filled, default directories names could be left as is):

[urls]
TRAIN_DATA_URL = # Link for downloading final_project_train_dataset.zip 
TEST_DATA_URL = # Link for downloading final_project_test_dataset.zip

[dirs]
# Default directories in which corresponding data will be stored
PROCESSED_DATA_DIR = /usr/dsapp/data/processed/  
RAW_DATA_DIR = /usr/dsapp/data/raw/

MODELS_DIR = /usr/dsapp/outputs/models/
PREDICTIONS_DIR = /usr/dsapp/outputs/predictions/

Execution of container responsible for loading data, models training and storing the following command should be executed:

docker build -t models-train-image -f ./src/train/Dockerfile . 
docker run --volume=$pwd\data\raw:/usr/dsapp/data/raw  \
           --volume=$pwd\data\processed:/usr/dsapp/data/processed \
           --volume=$pwd\outputs\models:/usr/dsapp/outputs/models \
           --network=bridge \
           --runtime=runc -d models-train-image

For models evaluation the following commands will create and run required container:

docker build -t models-test-image -f ./src/inference/Dockerfile . 
docker run --volume=$pwd\outputs\predictions:/usr/dsapp/outputs/predictions  \
           --volume=$pwd\outputs\models:/usr/dsapp/outputs/models \
           --volume=$pwd\data\processed:/usr/dsapp/data/processed \
           --network=bridge \
           --runtime=runc -d models-test-image

Exploratory Data Analysis.

General dataset characteristics.

  • 40'000 rows in training dataset in total. There are 272 duplicated rows and 266 duplicated reviews in total.
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count unique top freq
review 40000 39728 Loved today's show!!! It was a variety and not... 5
sentiment 40000 2 positive 20000
  • There are no null or empty string values in both review and sentiment columns.

  • There are no same reviews with different sentimental, therefore no logical inconsistency persists in train dataset.

  • There are equal amount of positive and negative sentiments (20'000 rows for both). png

Feature Engineering and Text Analysis.

The following numerical features of reviews were investigated:

'number_of_words'
'number_of_chars'
'percentage_of_signs'
'number_of_excl_marks'
'number_of_question_marks'
'number_of_ellipses'
'number_of_uppercase_words'
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count mean std min 25% 50% 75% max
number_of_words 40000.0 231.362750 171.083908 4.000000 126.000000 173.00000 282.000000 2470.000000
number_of_chars 40000.0 1310.549450 987.955229 41.000000 699.000000 971.00000 1595.000000 13704.000000
percentage_of_signs 40000.0 21.977625 1.825969 11.764706 20.805369 21.83136 22.940277 87.311178
number_of_excl_marks 40000.0 0.971950 2.957310 0.000000 0.000000 0.00000 1.000000 282.000000
number_of_question_marks 40000.0 0.645175 1.495052 0.000000 0.000000 0.00000 1.000000 35.000000
number_of_ellipses 40000.0 0.499400 1.580463 0.000000 0.000000 0.00000 0.000000 48.000000
number_of_uppercase_words 40000.0 4.878900 5.585357 0.000000 1.000000 3.00000 6.000000 151.000000
  • Length of characters / words in review are skewed left. Mean number of words in review is 231 and mean number of chars -- 1310. While maximum number of chars is more than 13'000. Maximum number of words is 2'500.

png png

  • There is no high correlation between them (considering Pearson, Kendall and Spearman correlation coefficients), except obvious dependency between number of words and characters.

  • Average ratio of non-alphabetical chars in review is 21% (which is pretty high). Since in texts appears fragments <br />, blablablabla+, >>>>>>>, *[word]*, ........, ?[word]?, [word]-[word]-[word] and other noise, therefore number of particular characters / marks were kept as numerical features, and all other non-alphabetical signs were removed during text preprocessing. png

  • Were obtained, that number_of_ellipses and number_of_question_marks are higher for negative sentiments. For other numerical_review_features box-plots showed no significant difference in distributions between positive and negative sentiments.

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sentiment negative positive
number_of_words count 20000.000000 20000.000000
mean 230.186850 232.538650
std 165.642483 176.353828
min 4.000000 10.000000
25% 128.000000 125.000000
50% 174.000000 172.000000
75% 280.000000 284.000000
max 1522.000000 2470.000000
number_of_chars count 20000.000000 20000.000000
mean 1298.143300 1322.955600
std 950.224379 1024.170719
min 41.000000 65.000000
25% 705.000000 692.000000
50% 974.000000 968.000000
75% 1576.000000 1614.000000
max 8969.000000 13704.000000
percentage_of_signs count 20000.000000 20000.000000
mean 22.163721 21.791530
std 1.776076 1.856013
min 11.764706 14.925373
25% 20.985011 20.629241
50% 22.025873 21.639938
75% 23.125000 22.758413
max 38.847858 87.311178
number_of_excl_marks count 20000.000000 20000.000000
mean 1.009400 0.934500
std 2.540263 3.322057
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 1.000000 1.000000
max 70.000000 282.000000
number_of_question_marks count 20000.000000 20000.000000
mean 0.905000 0.385350
std 1.825881 1.000802
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 1.000000 0.000000
max 35.000000 16.000000
number_of_ellipses count 20000.000000 20000.000000
mean 0.599500 0.399300
std 1.664343 1.485183
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 1.000000 0.000000
max 48.000000 48.000000
number_of_uppercase_words count 20000.000000 20000.000000
mean 5.171500 4.586300
std 5.608267 5.547079
min 0.000000 0.000000
25% 2.000000 1.000000
50% 4.000000 3.000000
75% 7.000000 6.000000
max 151.000000 126.000000

png

Text Preprocessing.

Actions Preformed.

Removal of noise and non-alphabetical characters.

  1. Removal of reviews-outliers, which length is not inside IRQ. Total number of outliers is 2958, and number of reviews left for models training 36770.
  2. Removal of HTML-tags (<br />).
  3. Removal of punctuation string.punctuation.
  4. Removal of nltk.corpus.stopwords.words('english').
  5. Removal of all digits.
  6. Removal of emojis and non-printable characters (all characters left are in string.printable).

Text tokenization.

Text tokenization was preformed with nltk.tokenize.word_tokenize.

Lemmatization.

In computational linguistics, lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighbouring sentences or even an entire document.

def lemmatize_words(
    text: list
):
    import nltk
    from nltk.corpus import wordnet
    from nltk.stem import WordNetLemmatizer

    nltk.download('wordnet')
    nltk.download('averaged_perceptron_tagger_eng')
    
    lemmatizer = WordNetLemmatizer()
    wordnet_map = {
        "N": wordnet.NOUN, 
        "V": wordnet.VERB, 
        "J": wordnet.ADJ, 
        "R": wordnet.ADV
    }
    pos_tagged_text = nltk.pos_tag(text)
    return [lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text]


reviews_df['lemmatized_review'] = reviews_df['tokenized_review'].parallel_apply(lemmatize_words)

Stemming.

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

def stem_words(
    text
):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    
    return [stemmer.stem(word) for word in text]


reviews_df['stemmed_review'] = reviews_df['tokenized_review'].parallel_apply(stem_words)

Lemmatization and Stemming comparison.

  • Number of unique words were produced after both approaches.
Number of unique stemmed words: 93786
[('movi', 70630),
 ('film', 59314),
 ('one', 33908),
 ('like', 28273),
 ('good', 19876),
 ('time', 19668),
 ('watch', 19329),
 ('see', 18310),
 ('make', 18160),
 ('get', 17379)]
 
Number of unique lemmatized words: 138953
[('movie', 69502),
 ('film', 58016),
 ('one', 30981),
 ('make', 27532),
 ('see', 26538),
 ('like', 26363),
 ('get', 21789),
 ('good', 21553),
 ('time', 19026),
 ('watch', 17412)]
  • Number of short words with length less than 3.
Number of stemmed words with length less than two: 534
[(11162, 'go'),
 (6299, 'im'),
 (4487, 'us'),
 (3739, 'tv'),
 (3235, 'he'),
 (2411, 'aw'),
 (1979, 'mr'),
 (1804, 'ye'),
 (1794, 'id'),
 (1746, 'oh'),
 (1372, 'ok'),
 (983, 'ad'),
 (915, 'th'),
 (857, 'dr'),
 (813, 'b'),
 (740, 'la'),
 (737, 'de'),
 (537, 'na'),
 (513, 'of'),
 (505, 'ed')]
 
Number of lemmatized words with length less than two: 931
[(16095, 'go'),
 (5953, 'Im'),
 (3878, 'do'),
 (3816, 'u'),
 (3494, 'TV'),
 (2174, 'he'),
 (1671, 'Id'),
 (1631, 'Mr'),
 (1200, 'Oh'),
 (1166, 'OK'),
 (895, 'th'),
 (851, 'US'),
 (824, 'Dr'),
 (678, 'B'),
 (522, 'na'),
 (492, 'oh'),
 (488, 'OF'),
 (464, 'Ed'),
 (427, 'Ms'),
 (424, 'II')]
  • Due to higher number of distinct words, including shorted words, for models training stemmed reviews will be used.

Vectorization.

For future models training and evaluation we have prepared two vectorized versions using simple Count Vectorizer and TF-IDF.

def vectorize_review(
        df: pd.DataFrame,
        processed_text_col_name: str,
        train_df_len: int,
        vectorizer
):
    vectorized_data = vectorizer.fit_transform(df[processed_text_col_name])
    return train_test_split(
        vectorized_data,
        df['sentiment'],
        test_size=train_df_len,
        shuffle=False
    )

Count Vectorizer.

The count vectorizer is a customizable SciKit Learn preprocessor method. It works with any text out of the box, and applies preprocessing, tokenization and stop words removal on its own. These tasks can be customized, for example by providing a different tokenization method or stop word list. (This applies to all other preprocessors as well.) Applying the count vectorizer to raw text creates a matrix in the form of (document_id, tokens) in which the values are the token count.

count_vectorizer = CountVectorizer()
count_X_train, count_X_test, count_y_train, count_y_test = vectorize_review(
    df=general_df,
    processed_text_col_name='stemmed_review',
    train_df_len=len(train_df),
    vectorizer=count_vectorizer
)

TF-IDF Vectorizer.

The Term Frequency/Inverse Document Frequency is a well-known metric in information retrieval. It encodes word frequencies in such a way as to put equal weight to common terms that occur in many documents, as well as uncommon terms only present in a few documents. This metric generalizes well over large corpora and improves finding relevant topics.

tfidf_vectorizer = TfidfVectorizer()
tfidf_X_train, tfidf_X_test, tfidf_y_train, tfidf_y_test = vectorize_review(
    df=general_df,
    processed_text_col_name='stemmed_review',
    train_df_len=len(train_df),
    vectorizer=tfidf_vectorizer
)

Models Training.

There were three models chosen appropriate for binary sentiment analysis: SVM with linear kernel, Logistic Regression and Bernoulli Naive Bayes. All of them were trained and evaluated for best approach choosing. Main function for models training:

def train_model(
        X_train,
        Y_train,
        classifier
):
    classifier.fit(X_train, Y_train)
    joblib.dump(classifier, f'{MODELS_DIR}{classifier.__class__.__name__}.pkl')
    return classifier

Bernoulli Naive Bayes.

Bernoulli Naive Bayes is a variant of Naive Bayes that is particularly suited for binary/boolean features. The model is based on the Bayes Theorem and assumes that all features are independent given the class. In text classification, this model is typically used with binary feature vectors (rather than counts or TF-IDF features) which indicate the presence or absence of a word. For sentiment analysis, each word in the vocabulary is treated as a feature and it contributes independently to the probability that the sentiment is positive or negative.

Advantages:

  • Handling of Binary Data: It works well with binary feature models which are common in text processing where presence or absence of words is a useful feature.
  • Scalability and Speed: The independence assumption simplifies computation, making this model very efficient and scalable to large datasets.
  • Performance: Despite its simplicity and the strong independence assumption, Bernoulli Naive Bayes can perform surprisingly well on sentiment analysis tasks, especially when the dataset is large.
BNB = BernoulliNB()
bernoulli_nb = train_model(
    count_X_train,
    count_y_train,
    BNB
)

SVM.

Support Vector Machine (SVM) is a powerful and versatile supervised machine learning algorithm used for classification and regression tasks. In the context of text sentiment analysis, SVM with a linear kernel is particularly useful. The linear kernel is a dot product between two instances, and it offers a straightforward linear decision boundary. The main idea behind SVM is to find the optimal hyperplane that maximally separates the classes in the feature space. For binary classification, such as positive/negative sentiment analysis, SVM focuses on constructing the hyperplane that has the largest distance to the nearest training data points of any class, which are called support vectors. This margin maximization offers robustness, especially in high-dimensional spaces.

Advantages:

  • Effectiveness in High-Dimensional Spaces: SVMs are particularly effective in high-dimensional spaces, which is typical in text data due to the large vocabulary size.
  • Robustness: The margin maximization principle helps SVMs to be robust against overfitting, especially in linearly separable cases.
  • Scalability: With linear kernels, SVMs can scale relatively well to large text datasets.
scaler = StandardScaler(with_mean=False)
scaler.fit(count_X_train)

norm_count_X_train = scaler.transform(count_X_train)
norm_count_X_test = scaler.transform(count_X_test)
norm_count_y_train = count_y_train.apply(lambda x: 1 if x == 'positive' else 0)
norm_count_y_test = count_y_test.apply(lambda x: 1 if x == 'positive' else 0)

SVM = svm.SVC(kernel='linear')
SVM = train_model(
    norm_count_X_train,
    norm_count_y_train,
    SVM
)

Logistic Regression.

Logistic Regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. In the context of sentiment analysis, the probabilities describing the possible outcomes of a single trial are modeled as a function of the predictor variables (text features). Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

Advantages:

  • Interpretability: Unlike more complex models, logistic regression models have the advantage of being interpretable. Each feature’s weights indicate the importance and influence on the sentiment classification.
  • Efficiency: Logistic regression is less computationally intensive than more complex algorithms, making it a relatively fast model to train.
  • Probabilistic Interpretation: The model outputs a probability for the sentiment class, which can be a useful measure of confidence in the classification.
LG = LogisticRegression()
logistic_regression = train_model(
    tfidf_X_train,
    tfidf_y_train,
    LG
)

Models Evaluation.

Metrics were considered for models evaluation:

  • sklearn.metrics.accuracy_score.
  • sklear.metrics.confusion_matrix.
  • Weighted average for precision, recall, f1-score and support (using sklearn.metrics.classification_report).
  • Time required for model's training and evaluation.

Bernoulli Naive Bayes.

  • Accuracy score is: 85.13%.
  • CPU times: 1.53 s.
  • Wall time: 2.6 s.
  • Confusion matrix:

png

  • Classification report:
              precision    recall  f1-score   support

    negative       0.84      0.87      0.85     18442
    positive       0.86      0.83      0.85     18328

    accuracy                           0.85     36770
   macro avg       0.85      0.85      0.85     36770
weighted avg       0.85      0.85      0.85     36770

SVM.

  • Accuracy score is: 81.95%.
  • CPU times: 3min 17s.
  • Wall time: 4min 1s.
  • Confusion matrix:

png

  • Classification report:
              precision    recall  f1-score   support

           0       0.83      0.81      0.82     18442
           1       0.81      0.83      0.82     18328

    accuracy                           0.82     36770
   macro avg       0.82      0.82      0.82     36770
weighted avg       0.82      0.82      0.82     36770

Logistic Regression.

  • Accuracy score is: 85.69%.
  • CPU times: 2.12 s.
  • Wall time: 2.65 s.
  • Confusion matrix:

png

  • Classification report:
              precision    recall  f1-score   support

    negative       0.88      0.86      0.87     18442
    positive       0.86      0.88      0.87     18328

    accuracy                           0.87     36770
   macro avg       0.87      0.87      0.87     36770
weighted avg       0.87      0.87      0.87     36770

Conclusion.

Performance Evaluation.

Among all evaluated models, the worst overall performance was for SVM with linear kernel. Both training and testing time was ~90 times more than for other models (additional time for standardization was not included). Given accuracy is 81.95%, while Naive Bayes and Logistic Regression performance were above 85%. From two other models as a working one Logistic Regression was selected due to the following considerations:

  • Clarity in Decision-Making: Logistic Regression provides coefficients for each feature (word or phrase in this context), indicating the strength and direction of their impact on the sentiment. This interpretability is crucial for understanding which aspects of the reviews most influence the sentiment, allowing for more informed decision-making and adjustments in strategy.
  • Threshold Adjustment: The output of Logistic Regression is a probability, providing a nuanced view of sentiment beyond simple binary classifications. This allows for threshold tuning based on business needs, such as prioritizing precision over recall (or vice versa).
  • Quick Deployment: Logistic Regression is generally less computationally intensive than models like SVM with non-linear kernels or deep learning models. This efficiency facilitates quicker retraining cycles and easier deployment, which is beneficial in dynamic environments where models need frequent updates.

Business Applications.

Chosen Logistic Regression model for binary sentiment analysis of movie reviews could be used and adjusted for the following business purposes:

  • Audience Sentiment Tracking: Understand public sentiment toward movie releases, promotional campaigns, or other media content. This can guide marketing strategies and content adjustments.
  • Recommendation Systems: Enhance user experience on streaming platforms by recommending movies based on the sentiment of reviews they find aligning with their preferences.