movie-chatbot

This NoteBook can totaly be run on GoogleColab, try it! -->

Can be done with:

Lower case
Standardizing numbers (ex. '12' -> 'number')
Transform question mark ('?' -> 'questionmark')
Word Stemming (ex. 'discount', 'discounts', 'discounted', 'discounting' -> 'discount')
Removal of non-usefull characters/words (ex. stop words, ponctuation)

Word to vectors:

For the input text, fill a list of the size of the vocabulary list, with the score of each word. The following scoring method can be use for n-gram:
- Binary, i.e. the word is present or not in the text
- Count, i.e. the number of time the word appear in the text
- Frequency, i.e. Count/Total number of words in the text
- TF-IDF (Term Frequency – Inverse Document Frequency), i.e. the score increase with the word frequency, but a penality is given if this word is widely used in the training set (like 'for', 'a', 'the'). The scores have the effect of highlighting words that are distinct (contain useful information) in a given text.
If the dataset is big and the sentence are small, we can use word embeddings.

Temps Colab 25min, Temps local 22min

maximecharriere/movie-chatbot