Machine Learning and Real World data

https://www.cl.cam.ac.uk/teaching/2122/MLRD/

Sentiment Classification (NLP)
Hidden States Prediction (Hidden Markov Model)

Dependency

NLTK-Natural Language Toolkit IDE: PyCharm

Sentiment Lexicon Database

Given a list of tokens from a tokenized review and a lexicon containing both sentiment and magnitude of a word, determine whether the sentiment of each review in the test set is positive or negative based on whether there are more positive or negative words.
Classification: label Lexicons into postive and negative.
Evaluation: Based on reviews on IMDb.
Improve the classifier using thresholds for decision bounds.

Your accuracy: 0.6355555555555555
Your improved accuracy: 0.6888888888888889

Naive Bayes Classifier

Parameter estimation
How to deal with a word in a review was not present in the training dataset?\ Ignore its contribution or using add-one (Laplace) Smoothing

Your accuracy using simple classifier: 0.63
Your accuracy using unsmoothed probabilities: 0.49
Your accuracy using smoothed probabilities: 0.795

Zipf’s Law and Heaps’ Law

Zipf’s law says that there is a reverse exponential relationship between the frequency of a word (fw) in a large natural language text, and its relative frequency rank (rw; the ranking of its frequency in comparison with other words’ frequencies)
Heaps’ law relates the number of distinct words in a text to the overall number of words in the text.

Statistical Significance Testing

Modify the simple classifier to include the information about the magnitude of a sentiment.
A word with a strong intensity should be weighted four times as high for the evaluator.
Implement the two-sided sign test algorithm to determine if one classifier is significantly better or worse than another. The sign for a result should be determined by which classifier is more correct and the ceiling of the least common sign total should be used to calculate the probability.

The p-value of the two-sided sign test for classifier_a "classifier simple" and classifier_b "classifier magnitude": 0.6722499772048186
The p-value of the two-sided sign test for classifier_a "classifier magnitude" and classifier_b "naive bayes classifier": 0.07683763213126037

Cross-Validation and Evaluation Sets

Training and Evaluation Sets

Random: Assign the datapoints to folds randomly.
Stratified random: Assign the datapoints to folds randomly but make sure that there is the same number of positive and negative reviews in each fold.

Random cross validation accuracies: [0.8, 0.8111111111111111, 0.8444444444444444, 0.7888888888888889, 0.8222222222222222, 0.8277777777777777, 0.8222222222222222, 0.8055555555555556, 0.8333333333333334, 0.7888888888888889]
Random cross validation mean accuracy: 0.8144444444444444
Random cross validation variance: 0.00031604938271604933

Stratified cross validation accuracies: [0.8055555555555556, 0.8388888888888889, 0.8555555555555555, 0.8666666666666667, 0.8222222222222222, 0.7611111111111111, 0.8388888888888889, 0.7944444444444444, 0.8388888888888889, 0.7944444444444444]
Stratified cross validation mean accuracy: 0.8216666666666667
Stratified cross validation variance: 0.0009472222222222234

Smoothed Naive Bayes accuracy on held-out data: 0.84
Confusion matrix:
             ACTUAL
          | pos | neg |
     -----+-----+-----+
      pos |  92 |  24 |
PRED -----+-----+-----+
      neg |   8 |  76 |
     -----+-----+-----+

Smoothed Naive Bayes accuracy on 2016 data: 0.8333333333333334
Confusion matrix:
             ACTUAL
          | pos | neg |
     -----+-----+-----+
      pos |   8 |   0 |
PRED -----+-----+-----+
      neg |   4 |  12 |
     -----+-----+-----+

Simple Sentiment Classifier performance accuracy on held-out data: 0.615
Simple Sentiment Classifier performance accuracy on 2016 data: 0.7916666666666666
P-value of significance test between NB and Simple Classifier on 2016 data: 1.0

Nuanced Classifier

Agreement table; Calculate Fleiss’ kappa

Your accuracy on the nuanced dataset: 0.5926851851851852

Agreement table for this year.
                   REVIEW
          |  1  |  2  |  3  |  4  |
     -----+-----+-----+-----+-----+
      pos |  63 |   6 |   2 | 102 |
SENT -----+-----+-----+-----+-----+
      neg |  42 |  99 | 103 |   3 |
     -----+-----+-----+-----+-----+
    
The cohen kappa score for the review predictions is 0.6454024550722497.
The cohen kappa score for the review predictions of review 1 and 2 is 0.3275457197751368.
The cohen kappa score for the review predictions of review 3 and 4 is 0.906218599977671.

Agreement table for the years 2019 to 2022.
                   REVIEW
          |  1  |  2  |  3  |  4  |
     -----+-----+-----+-----+-----+
      pos | 309 |  30 |   8 | 454 |
SENT -----+-----+-----+-----+-----+
      neg | 163 | 442 | 464 |  18 |
     -----+-----+-----+-----+-----+
    
The cohen kappa score for the review predictions from 2019 to 2022 is 0.6523603977342954.

Hidden Markov Models (HMM)

Two matrices for the HMM: transition matrix (A) and emission matrix (B).
Viterbi Algorithm: DP
Evaluation: precision, recall and F-measure
Semi-Supervised Learning with Self-Training
0. Train an HMM model with the available labelled training data as done in the first part of the task.

Use the trained model to predict the hidden sequences of all available unlabelled data.
Merge the labelled data with the new pseudo-labelled data.
Train a new HMM model on the merged data.
Repeat Step 1 (for the now pseudo-labelled data) to 3 for a total of t iterations.

Social Networks

Undirected and unweighted graph, Node, Link.
e.g, Facebook-style networks, Twitter-style networks
Visualize the network with Gephi
Find the degree of each node.
Determine the diameter of the network using a breadth-first all-pairs shortest path (APSP) algorithm.

Calculate betweenness centrality using Brandes algorithm: networkx.centrality

Clustering

Girvan-Newman algorithm on facebook_circle.edges

PeterHUistyping/Machine_Learning-Real_World_Data