From Asad Sayeed's statistical NLP course at the University of Gothenburg.
My name: Elin Hagman
GitHub: eliyetres
GU: gusihaliel
I created an error message when trying to trunkate to dimensions higher than the vocabulary count. The terminal prints the error message: "Error: Singular value decomposition dimensions must be lower than vocabulary limit." and exits without errors.
Running anything using the full vocabulary resulted in a MemoryError on my computer, so I ran the files in the server. All the output files are in the LT2212-vt-19-a2 folder on the server.
I added a .zip file containing all the outputs.
gendoc.py outputs a csv file with article name as columns and words as rows. simdoc.py takes the output file and calculates the cosine similarity between the articles from crude and grain respectively.
I restricted the vocabulary to 20 words. These words are the most frequent in all the articles and are be more significant.
File names | Crude-Crude | Grain-Grain | Crude-Grain | Grain-Crude |
---|---|---|---|---|
allfiles | 0.37 | 0.33 | 0.31 | 0.31 |
limit20 | 0.69 | 0.63 | 0.62 | 0.62 |
allfilestdif | 0.11 | 0.10 | 0.07 | 0.07 |
limit20tdif | 0.65 | 0.58 | 0.54 | 0.54 |
allfilessvd100 | 0.50 | 0.46 | 0.42 | 0.42 |
allfilessvd1000 | 0.37 | 0.33 | 0.31 | 0.31 |
allfilestdifsvsd00 | 0.26 | 0.22 | 0.18 | 0.18 |
allfilestdifsvd1000 | 0.11 | 0.10 | 0.07 | 0.07 |
We convert the word frequency of the articles of the two classes into vectors that represent the words by numbers. We usec cosine similarity by comparing documents belonging to the the classes. The fewer the words the documents have in common the lower the cosine similarity will be. Singular-value decomposition (SVD) trunkates the size or the matrices making it faster to compute.
Words that appear in similar parts of text should mean they are close in meaning. SVD reduces the number of rows while preserving the similarity structure. The cosine similarity shows the cosine angel between the two vectors.