sinaahmadi/klpt

Stop words?

DBCerigo opened this issue ยท 5 comments

Hi Sina,

Thanks for the great package. We're implementing usage of it for analysis of scraped social media data from Iraq and other Kurdish speaking areas to enable local peace builders to better understand online popularization and how conflicts are being played out in digital public spheres to aid their peace building initiative design. https://gitlab.com/howtobuildup/phoenix/

I wanted to ask regarding the Sorani and Kurmanji stop words found https://github.com/sinaahmadi/klpt/blob/master/klpt/data/stopwords.json
To confirm; they're not currently used within the packages functionality right?
Re:

# def remove_stopwords(self, text):

Would you see any issue with us using the stopwords.json directly from the package ourselves, post preprocessing and pre tokenisation stages?

Thanks a lot for any of your time spent on considering this ๐Ÿ™‚

Hi @DBCerigo,
Thanks for creating this issue. This has been in my to-do list for quite a long time. This seems to be the occasion to add it then :-)

I'll be back to you soon.

Hi again @DBCerigo,

The stopwords are now available in the latest version of the package using the stopwords variable when importing the Preprocess module.

I hope this helps. Please let me know if you would need any further features to be added to the package.

Hey @sinaahmadi, thanks very much for the super prompt reply and implementation! Really appreciated.

We've already implemented using the stopwords in our opensource project https://gitlab.com/howtobuildup/phoenix/-/merge_requests/165/diffs

I wondered if you might be open to a discussion about sponsorship for extending klpt further, specifically in extending the stemming functionality? We may have some funds available so sponsor such work, if you were interested in such an arrangement :)

Hi Daniel (@DBCerigo),

Thanks for your interest ๐Ÿ˜Š
Absolutely. I'll be happy to collaborate on any related projects. I have been actively looking for sponsors as well. If you are interested, that'll be great. I prefer GitHub Sponsors for this purpose to keep everything as transparent as possible: https://github.com/sponsors/sinaahmadi/ (you can choose a custom amount too).

I would also like to know more concretely about your tasks of interest so that I can prioritize them in future versions. The stemming functionality is limited to verbs now. Completing that and also enriching the dictionary would be the next step, I guess. One missing functionality of the stemmodule is lemmatization. That should also be added at some point.

Depending on your sponsorship and your priorities, I can also include a module for syntactic analysis (pos tagging and parsing). Likewise, there is an ongoing project on sentiment analysis that I hope will be also integrated in the coming months.

Looking forward to hearing from you.

Awesome :)

GitHub Sponsors should be all fine.

Do you want a short meeting together? I can run you through the overall aims of our current project, how we are currently using klpt, and which extensions to it would be most valuable to our project. My email is dan@datavaluepeople.com - feel free to send me a cal invite for a time that suits you best (I'm generally flexible during GMT working hours). Great!