/Uni-CourseWork-AI-GroupAssignment-3

Using probability for breaking encrypted text, spam classifier, and POS tagging

Primary LanguageRoff

a3

Part 1: Part-of-Speech tagging

Formulation and code description:
In this part our goal is to assign a part-of-speech to every word in the sentence using 3 types of Bayes Nets: Simple, HMM and MCMC. As given in the instructions we are only required to change the pos_solver.py file, hence it is the only file that has been modified. A new fucntion is defined called calculate_probabilities which calculates all the probabilities required for all the 3 models. This function is called in train function. Following probabilities are calculated:
emission: p(W_i/s_i) = c(s_i,w_i)/c(s_i)
transition: p(s_i+1,s_i) = c(s_i+1,s_i)/c(s_i)
initial: p(s_i) = occurrce of s_i in first word of the sentence/length of data.
Here s_i is POS tag and w_i is word.

Starting with the simple model, where each observed variable is dependent only on the its own hidden variable, implementing this Bayes net was straight forward. To get the most probable tags for the words of sentence we find the tag which is has maximum probability amongst all the other tags associated with this word.
-for each word
   -if w has s_1....s_k tags:
      -p(s_i/w) = c(w,s_i)/c(w,s_1) + ... + c(w, s_k)
Here c(w,s_i) is no. of times w/s_i appears in the corpus.
If a word or tag is present in test set but not in train set, then that word is assigned the most occuring tag in the corpus.

Next up is HMM which is solved using viterbi algorithm. In this model the observed variable is dependent on the it's hidden variable and there is also a dependency of hidden variable on the hidden variable of the previous observed variable. Since viterbi uses the concept of dynamic programming we have maintained a list which holds a dicitonary, and the dictionary contains the probabilities for all the POS tags for a particular word (this makes column of word). Once, this so called matrix is calcualted we backtrack to get the list of most probable tags for the sentence. Viterbi algorithm has 3 parts.
~ calculate probabilities for the first column (or first word of the sentence) using initial probabilities and the emission probabilities.
~ calculate probabilities for the rest of the column (or rest of the words in the sentence) using transition probabilities, emission probabilities and state probabilities.
~ Find the maximum probability in the last column and backtrack to get the most probable tags for the sentence, append them into a list and return the list.
Reference: https://web.stanford.edu/~jurafsky/slp3/8.pdf

MCMC (Monte carlo markov chains) are used for randomizing and creating a sampling space. Here, we start with the data and assume random POS tags for the sentence in the start. Then we start randomizing the POS tags, we use the disctribution and caluculate the probability of each word, given all the other words are tagged. We assign the maximum POS tag for this word using this formula :
For the first word of the sentence probability is calculated as: P(W0)=P(Wi/S0)*P(S0)
For last word of the sentence probability is calculated as: P(Wi/Si)=P(Wi/Si)*P(Si/Si-1)*P(Si-1)*P(Si/S0)*P(S0)
For other words of the sentence probability is calculated as: P(Wi/Si)=P(Wi/Si)*P(Si/Si-1)*P(Si-1)
1000 samples are processed where the values of initial sample is consists noun for every word in a sentence.
For each sample, every word's probability is calculated for all 12 parts of speech tags and the tag that gives highest probability is considered for that word. The tag of this word is then updated in the current sample which is used in next iteration of samples. This is done for 1000 samples and then tag count is stored for every word from the samples. While doing so, first 500 samples are ignored. The tag having maximum count for a word will be the final tag for that word, and thus final pos tags list will have tags with maximum probability tag value for all the words.

Log Posterior is calculated using the formula for calculating posterior probability:
posterior: p(t|w) = p(t) * p(w|t)/ p(w)
and later applying log with base 10 to it.

Other Dicussion:
There were many design decision that had to be made initailly like: How to train the data, which type of data structure to be used, What should be the flow of the code. One of the major design decision made was to make global dictionaries for probability so that the whole code can access it. Once the above questions were answered we went onto working on the code. For simple model, the implementation was straight forward. One assumption considered for this model was that if a word or tag is present in the test file but not in the train file we assign the word the most occurring tag in the corpus.

For HMM model we have maintained a list called viterbi which works similar to a matrix (because viterbi uses the concept of dynamic programming). For this model the assumption we made is that if a word is present in the test but not in train, emission proabability is assign a very low probability to that word. One of the major issue we had for this model were key errors. We had to apply various conditions to handle with these errors.

For MCMC we have two assumptions: first is the same as previous one that if a word or a tag is not present in the train set, emission probability is assign a very low probability. Second, is that for every word we are getting the tag having maximum probability. However, we are working with occurrences and ignoring the division by the total occurrences of the tag for that word as it wont have any effect.

Results:
==> So far scored 2000 sentences with 29442 words.

words correct sentences correct
0. Ground Truth 100.00% 100.00%
1. Simple 93.92% 47.45%
2. HMM 95.06% 54.25%
3. Complex 92.12% 40.80%

**

Part 2: Code breaking

** **

#(1) a description of how you formulated the problem

**

Since the Encryption is entirely random, the decoding I formulated was based on randomness as well, but in the opposite flow of the encoding function. I visualized this problem as a search where we have to maximize the likelihood of the text to be as English-like as possible. Since the probabilities were gonna be very small, I used negative logs instead and converted the maximization to the minimization of cost problem instead where cost are negative logs. The graph was something like this.

Here, the cost at A would be local minima, cost at B would be global minima. We need to return the lowest value we have found yet as the answer after the ten-minute mark. This could be that we never find B but only find A, ideally, B would be the answer that’s returned at the ten-minute mark. So we need to search along the graph for minima and store them as they come along. We use the metropolis hastings algorithm for that. Our search doesn’t always minimize the cost since this will cause it to be stuck at local minima if we encounter one before the global minima (the answer).

How the search works:

  • At the start, we calculate the probability of the input string, by adding the probability of the first character the probability of the current letter given the previous letter for each letter of the word.
  • Then the probability of each word added to find the probability of the entire string. This is the minima yet so we need to store it.
  • Now we need to calculate two tables, rearrangement and replacement tables and decrypt text with them. This is our starting point.
  • Calculating the probability of this string will give us the next point on the graph. We compare it with the minima and store the new text if it’s lower.
  • Now we change the two tables a small bit and try decoding and calculating the probability again. If the probability increases it means we made the tables worse than what they initially were.
  • If the probability is lesser, it’s in the right decision and we need to store these tables over our previous tables.
  • Now, we can’t always ignore the worse side tables otherwise our search will stop in local minima. So with a small probability we give the freedom of going in the wrong direction to our search. We keep updating the minima as and when our search comes to them.
  • The entire process is just changing our two tables one swap at a time to get as close to the original encrypting tables as possible.

**

# (2) a brief description of how your program works

**

find_probab(text,frequency_dictionary): Returns a probability value for the text string using the frequency dictionary

decode(str2, decrypt_rearrange_table, decrypt_replace_table): Decodes the str2 string using decrypt tables and returns a string. This works in the opposite flow of the encode function by doing rearrangement first and then replacement.

modify_decryption_tables(decrypt_rearrange_table,decrypt_replace_table): Modifies either of the tables by swapping two elements. The chance of rearrangement table to replacement table odds are 6:325. Since there’s 4 keys in the rearrangement table so 4C2 is the number of ways two keys can be selected at a time: 6. Similarly 26C2 for replacement table which is 326.

break_code(string, corpus): Takes the string and corpus to return a decrypted answer in ten minutes. Calculates frequency in a nested dictionary such that outer keys are current text and inner keys are previous keys. a{b{count:1,probab:1}} means a occurring after b has already occurred happens once and the probability of that happening is one out of all the combinations. After probabilities are calculated we calculate initial encryption tables. We decrypt the string and then in a loop we modify tables, decrypt, check if these are better by comparing the probability values, store in minima if they are. This goes for ten minutes. We return minima at the end of the ten-minute mark as the final decrypted text.

**

#(3) discussion of any problems you faced, any assumptions, simplifications, and/or design decisions you made.

**

  • Initially, I thought of calculating the probabilities as the frequency of a current letter succeeding the previous letter divided by the frequency of the previous letter. Even though this made sense logically, the answers never came close and in fact when the inputs wherein perfect English got scores worse than that of random garbage text snippets. I changed this multiple times to finally settle to the length of the incoming string. These scaled-down probability values are enough to not make values too small to barely make any difference at any step to being too big so the jump in probability is too much and our program doesn’t make the jump thinking its wrong.

  • Rather than calculating the probability of each word, I just add the probability of the current letter being space as 1 so the log of it is 0 and it ends up being a normal summation of all the word probabilities but in a single loop that runs for the length of the character.

  • Also while deciding to go in the worse way where the new probability p(D’) is more in value than p(D), when I let the tables be selected even for worse swaps the code became too random. I noticed that the code would come close to the solution but just run away very easily. So i tried reducing the chance of this happening and made it more rare for thealgo to choose worse tables after swap instead forcing it to look for better solutions nearby, which was done by simply dividing the probability of choosing to do so ( p(D’)/p(D)) by a number like 50. This worked better in experimentation and the answer came closer so I choose to keep them.

  • My values for probability did drop to negative once the text got more and more English like. Logically this doesn’t make sense but here since I had just scaled down my probabilities and not divided them by a proper total, this was expected. I had to make my code to work for both cases, where my probabilities worked in positive values and negative values.

    Test Outputs for files encrypted-text-1.txt is output1.txt, ecrypted-text-2.txt is output2.txt and so on. Each was run with 10 minute limit. I also added a prop.txt file that was not encrypted to test how the code deals with such files, and it doesn’t change the input file at all since its already English like. Probab.txt is the dictionary of frequencies and the probabilities I am using to calculate the probability of the file.

Example > encrypted-text-3 Input: i enpepn o echwxeh epiz lktpw fhnhfq cfe hzkc ndtewc onl cknlqlo kctfcnoe g egzcz atgcnk mnceatnpekouf nnzhcf utz pwgnf pmewhzp qn fwfnfkh pige pny ekcxexcd guwec eocclkwp qnhfcqc hklpfkeup lkcmy ch cc eooce w lkz cxcnnkkphce pnokcl cu ooocftwc pkoefckhwhf nqfh wccentdof cgae lkckw fmc fthip accpx co kfckft klegiy nwtitpohkocp tflock cwc pfp eytc pweo kmk fc mlnhotocxckk ucn lc ftzw ehnhfq cfchnlxcg onfkfl lkkmt c chexuco knkhoc ckuxn k et ocgxc lkmkk fc mlcvcok e ntpznok zgi nggzckpwzhe kcl ogcwc oachnllxe oncpke k e cu dpcteptcwk e ncagn p ecpgncpgoc ouigsc tthwezecp e z nnaunfrgnpek geockotnop nikknpek clo ofm ezv jchf ginpqtk mnpk ncc lc omzk ecg lyl f peckknokeni hqppokfkcl lkwkh fqklnl c iqtbckpwe z l ckikhee z xfcxoz ghk ecn lk lshiunkn n chocohatk cn o ee othcht c

Output : o union is more profound than marriage for it embodies the highest ideals of love fidelitw devotion sacrifice and familw n forming a marital union tyo people become something greater than once thew yere s some of the petitioners in these cases demonstrate marriage embodies a love that maw endure even past death t yould misunderstand these men and yomen to saw thew disrespect the idea of marriage heir plea is that thew do respect it respect it so deeplw that thew seex to find its fulfillment for themselves heir hope is not to be condemned to live in loneliness ekcluded from one of civilizations oldest institutions hew asx for equal dignitw in the ewes of the lay he onstitution grants them that right he judgment of the ourt of ppeals for the ikth ircuit is reversed t is so ordered

The above output was generated in ten minutes during our testing, which was pretty close to the answer, thus we didnt find a need to perform beam search or optimizing the algorithm.

Part 3: Spam Classification

Naive Bayes classifier is used to implement the spam classifier.

This program involves a training part where the likelihood probabilities are calculated based on the provided training data and a testing part where the program predicts where the file is spam or not.

Methods used in the program:

Test() - To test the data and predict whether the file is spam or not

Train() - To train on provided spam and non-spam file

likelihood() - To calculate likelihood probabilities using spam and non-spam files.

trainSpam() - To train on spam files.

trainNotSpam() -To train on non-spam files.

Approach:

  • We need to calculate P(S/w1,w2,w3,…..wn)/ P(S_/w1,w2,w3,…..wn) for each test file where P(S/w1,w2,w3,…..wn) is probability of an email being a spam given the words w1,w2,w3,…wn and P(S_/w1,w2,w3,…..wn) is probability of an email not being a spam given the words w1,w2,w3,…wn.
  • For calculating these probabilities we need P(w1,w2,w3,…..wn/S)*P(S)/P(w1,w2,w3,…..wn) and P(w1,w2,w3,…..wn/S_)*P(S_)/P(w1,w2,w3,…..wn).
  • But as we calculate odds ratio we only need P(w1,w2,w3,…..wn/S), P(w1,w2,w3,…..wn/S_), P(S) and P(S_).
  • Now, assumption is made that no word is dependent on other word thus we need P(w1/S), P(w2/S)… P(wn/S) and P(w1/S), P(w2/S_)… P(wn/S_), where P(wn/S) is probability of word w1 occurring in a spam file and P(wn/S_) is probability of word wn occurring in non-spam file.

Implementation:

  • P(S) and P(S) is assumed as 0.5 as every mail has equal possibility of being a spam or not spam.

  • Frequency of words is calculated in trainSpam() and trainNotSpam() methods by going through every file in spam and non-spam data and breaking the file-text into token of words and saving the frequency of those words for every spam and non-spam files.

  • P(wi/S) and P(wi/S_) is calculated in the likelihood() method during training process for every word tokenized in the training process.

    P(wi/S) = frequency of the word in spam files/total number of words in spam files P(wi/S_)= frequency of the word in non-spam files/total number of words in non-spam files

  • After the probabilities are calculated using training data, these probabilities are used while testing data for predicting whether a file is ‘spam’ or notspam’.

  • Every test file is converted to bag of words, and log of ratio of P(S/w1,w2…wn)/P(S_,w1,w2,..wn) is calculated for every word in bag of words whose probability is calculated with training data where,

    P(S/w1,w2…wn)/P(S,w1,w2,..wn) = (P(w1/S)*P(w2/S)…*P(wn/S)*P(S))/ (P(w1/S_)*P(w2/S_)*P(wn/S_)*P(S_))

  • If the total value is greater than 1 (meaning if the probability of that email being spam given words is greater than 0.5) than that email is labeled as spam else is labeled as notspam. The file and the resulting label are stored in a list and is written to an output file.

Difficulties faced:

  • Difficulty was faced while opening some of the files, due to UniCodeError.
  • Thus, the file is opened using utf8 encoding and has attribute errors=’ignore’.
  • The comma, semi-colon and equal to is replaced from file text with whitespace to tokenize the words properly.
  • The word is converted to lower case before storing to avoid multiple copies of the same word in different cases.
  • A dictionary ‘vocab’ is maintained such that it contains all the unique words from training data and its frequency is maintained for spam and non-spam files.
  • The probabilities of test data are calculated using log to avoid working with extremely small values.

Approaches tried:

  • The likelihood probabilities are calculated as ->

    frequency of the word in spam or non-spam files/total number of words in spam or non-spam files

  • A different approach was tried originally where the likelihood probabilities were calculated as->

Word appearances in spam or non-spam files/total number of that spam or non-spam

Currently, the program predicts the spam file correctly 95% of the time.