In this project, I have made a toy GUI that calculates the TF-IDF vector of a sentence given. This project is to get better insights into the TF-IDF method of sentence embedding in Natural Language Processing.
The Idea is to provide the application with a preprocessed text file that it will open and then train the TF-IDF model internally. The application uses sklearns's implementation of TF-IDF vectorizer to train and then askes for a sentence (again preprocessed) to be provided, on which it runs the trained TF-IDF model. It then outputs the TF-IDF vector with the corresponding weights.
NOTE: The input file and the sentence need to be preprocessed in advanced.
python 3.6.6
scikit-learn 0.20.0
scipy 1.1.0
pyqt 5.9.6
The TF-IDF vectorizer method works by denoting each word in a sentence by a number. This number incorporates important parameters such as frequency and uniqueness.
TF - term frequency
IDF - inverse document frequency
We multiply these numbers together to find the final number for each of the word. The vector representing the sentence is now a collection of these numbers. This is how we find the TF-IDF vector.
Once we have TF-IDF vectors we can have similarity matrices like cosine similarity that find the similarity between two vectors.
This GUI works by first giving in a text file that has been normalized with no punctuations. The Python program underneath the GUI trains the TF-IDF vectorizer
imported from sklearn
.
To run the GUI, run app.py
, which will internally import tfidf.py
.
Selecting a file to train the tfidf model on.
The application then asks for an input sentence (again normalized).
The program takes in this sentence and transforms it into TF-IDF space and returns by a popping a popup what shows the TF-IDF vector of the sentence provided.
I have cited the code that I have reused from a source. 1, 2