This NoteBook can totaly be run on GoogleColab, try it! -->
- The Definitive Guide to Natural Language Processing (NLP)
- Text Classification
- Short Text Classification
- Text Classification Using Naive Bayes
- Text Classification Using Support Vector Machines (SVM)
- Word embeddings: how to transform text into numbers
- The Beginner’s Guide to Text Vectorization
- Sentiment Analysis
- Deep Learning for NLP
- Google text classification Guide
- Genre classification based on wiki movies plots
- Creat the vocabulary list with all words stem found in the training set
Can be done with:
- Lower case
- Standardizing numbers (ex. '12' -> 'number')
- Transform question mark ('?' -> 'questionmark')
- Word Stemming (ex. 'discount', 'discounts', 'discounted', 'discounting' -> 'discount')
- Removal of non-usefull characters/words (ex. stop words, ponctuation)
Word to vectors:
- For the input text, fill a list of the size of the vocabulary list, with the score of each word.
The following scoring method can be use for n-gram:
- Binary, i.e. the word is present or not in the text
- Count, i.e. the number of time the word appear in the text
- Frequency, i.e. Count/Total number of words in the text
- TF-IDF (Term Frequency – Inverse Document Frequency), i.e. the score increase with the word frequency, but a penality is given if this word is widely used in the training set (like 'for', 'a', 'the'). The scores have the effect of highlighting words that are distinct (contain useful information) in a given text.
- If the dataset is big and the sentence are small, we can use word embeddings.
Temps Colab 25min, Temps local 22min
- SVM is machine learing algorithm, MLP is a deep learning algorithm
- Il faut avoir le même nombre de features pour n'importe quel text
- La structure et l'ordre des mots est perdu