spacy
, especially fr_core_news_sm
pip install spacy
python3 -m spacy download ft_core_news_sm
NB : only the lemmatization part will use spacy
.
Text.py
: load, clean, and lemmatize a text.Tokens.py
: tokenize, count words, remove empty words, get n-grams from a text.
stop_words.txt
: list of stop-words used to remove empty wordstexts
: some texts to try the tools (NB : Rousseau's still not working, will be soon)
- Create token class so that tokens are list of token (attributes : pos, lemmas, etc.)
- Bag of words
- Freq ranked dict