optimaize/language-detector

using this library to detect arabic dialects

odaymard opened this issue · 23 comments

I am very happy to find this tool
i have a question
can this library help me in detecting arabic dialects (syrian iraqi gulf)
i will try to build corpora for each dialect and add it to language profile
is that right?

This is a very interesting idea. Is the spelling of words different from one Arabic dialect to another?

yes of course

I think the idea is sound and that it's worth trying. I don't think you would need to make any changes to this library in order to try it.

If you run into any trouble generating profiles from your corpora, feel free to post here and I'll be glad to help.

Sorry but i am new to java ,
how to run the code to detect some text
is it mandatory to have maven project
or can i add jar file to library ?

Yes, you can use it with a jar file--I find that to be the most convenient. There's sample code here, and here starting at line 16.

Hi how are you
thank you for helping me
please how to generate profile from text file ?

The easiest way is to use the jar from shuyo's repository. Here's an example of generating an Egyptian Arabic profile from its Wikipedia abstract, using Linux:

git clone git@github.com:shuyo/language-detection.git
cd language-detection
mkdir abstracts
cd abstracts
wget https://dumps.wikimedia.org/arzwiki/20160407/arzwiki-20160407-abstract.xml
mkdir profiles
java -jar ../lib/langdetect.jar --genprofile -d . arz

Run this process to make sure it works. Then replace the abstract text file with your dialect corpus and run the last step again.

@odaymard Any progress on this?

I am using facebook api and twitterapi to get data but facebook4j is slow I am trying to make it faster

Hi
I have finished syrian profile
how to upload it

Nice! You need to (1) fork this project, (2) add your new profile to your fork, and (3) create a pull request to this project.

I did that, what next?

It's up to you. Some suggestions:

  1. Wait for your pull request to be merged. In the meantime, it may be useful to others if you publish the training data used for various dialects in your own project on Github.
  2. Publish profiles for other dialects. Personally, I'd be interested in some test cases and/or test results showing how effective the profiles are for identifying/distinguishing actual regional texts or test cases. These tests would be most meaningful when more dialects are supported.

Nice work, thanks Oday and Robert.

I believe that we should include Arabic dialect profiles in the library, and start some separation in profile loading. I suspect that most users who want "all" languages just want one Arabic profile, one Norwegian, one English, one German, not dialects. Dialects is special purpose.

Hi @odaymard , @rmtheis , @fabiankessler ,
I have started using the library and it's really helpful, but I might have to add new language profile, could you please help me from where I can get the language corpora and what are all steps involved to generate language profile from language corpora? once I have new language profile how to add the same in profile folder?

Hi @odaymard ,
Thanks for the suggestion. Now I am able to add new language profile as per my requirement.

Regards,
Supriti

Hi @odaymard , @rmtheis,
I have Arabic dataset and want to check the dialects of it, (Egypt, the Levant, Iraq and the Gulf), how I can use this lib to do that.
Thanks in advance
safaa

@odaymard Are you willing to publish the other Arabic dialect profiles that you've generated? Apart from the interests of others here, I would like to make a basic free Android app with the profiles you've made. It would be a simple, free app that allows a user to paste in Arabic text and get a dialect estimate based on this library.

@safaahenno I think only the Syrian profile is available as of right now.

@rmtheis Can I use this Lib in Netbeans project, and I get the steps to use this lib in Netbeans without facing problem like "package org.jetbrains.annotations does not exist" because it's not clear in readme file.
thanks in advance.

@safaahenno You should open a separate issue for that. This issue pertains to Arabic dialects only.

@rmtheis Ok, will do