Text Learning | Example code and own notes while taking the course "Intro to Machine Learning" on Udacity.
In this model. text (such as a sentence or a document) is represented as the bag (multiset) of it's words, disregarding grammar and even word order but keeping multiplicity. (Wikipedia)
Give the array of sentences or documents and fit/transform vectorizer.vocabulary_.get("greet")
Not all words are equal! Some words contain more information than others!
You should remove the words called "stopwords" like; the, in, for, you, will, have, be.
from nltk.corpus import stopwords
sw = stopwords.words("english")
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.
For exapmple, the words below:
- unresponsive
- response
- responsivity
- respond
After using stemmer, they all transformed into "respon".
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("responsive")
Tf | Term frequency: like bag of words.
The weight of a term that occurs in a document is simply proportional to the term frequency.
Idf | Inverse document frequency: weighting by how often word occurs in corpus.
The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.