Discussion - stopwords
leomaurodesenv opened this issue · 4 comments
I liked the texthero
, and I want to contribute in somehow.
First, I want to discuss something that boring me - stopwords..
Problem - I want to deploy a solution without the spacy
stopwords requirements, and, possible, add my own stopwords.
My solution is based on Docker containers, is a bad practice download files every time that a new containers is instanced, causing a cold start problem, also using unnecessary space (because I don't use them).
In this sense,
- Is it possible to remove the
spacy
stopwords requirements? - How can we add general stopwords, according to our own language needs?
- Do we have some stopwords dictionary for many languages outside
spacy
? - How turn off stopwords download?
Hi Leonardo, thank you for opening this issue. I agree with you, it's quite annoying that stopwords are downloaded even when they are not needed. This should have been fixed in #194. I will soon release a new version that includes the patch.
Regarding your other questions:
- Removal of spacy stopwords requirements. I believe we can completely get rid of spacy requirements by saving in a txt file (or another file extension) all stopwords and load directly that one. Do you want to work on that?
- Multi-lingual support is something we would like to introduce for quite a long time ... if you are interested in helping out to develop a general solution that works for many languages I would be more than happy to talk!
- Currently, Texthero is fully supporting only English, adding stopwords on other languages (with Spacy for instance) should be trivial though; this is strictly related to point 1.
Hope it helps!
Best,
Hi Leonardo,
I just released a new version (Texthero 1.1.0); now stopwords should be downloaded lazily. Would you mind try it and let me know? Later on, we can discuss your other great points further!
Hello @jbesomi , sorry for my late answer.
Sure, I'm going to try out next week.
Yes, I would like to help. But, I'm not sure how to support multi-lingual stopwords.. But add multi-lingual embeddings could improve, and slowly the code. This is tough.. heheh
Removal of spacy stopwords requirements. I'm going to take a look and send a message here.
Thanks for the update Leo. As you suggested, we can start by improving the stopwords (for English) and see how it goes. Multilingual support requires some thinking and refactoring, we can discuss that later on once the simpler version is implemented.
Best,