optimaize/language-detector

Adding recognition of Walloon (wa) language

Opened this issue · 4 comments

srtxg commented

Hello,
I'm working on adding Walloon language to LanguageTool, which itself requires proper language detection from language-detector.
I don't see any clear instructions on how to generate a profile; so, as suggested, I'll attach some text files: http://chanae.walon.org/walon/wa.zip
It's a small zip file with some random pages from Wikipedia and rifondou.walon.org (for that last one, I only took texts more than 70 years old); it's about 2MB of text.
The zip include plain text dumps, as well as the html pages (which most often include, lang=... tags, in case it may be useful for you)

Another thing to know about Walloon, is that there are actually two ways of writting it.
A "unified orthography", called "rifondou" (which is the one used in those texts).
And a traditional "feller" one; which does a lot of emphasis on local accent and phonetic, with the consequence that is actually not one orthography, but a group of orthographies (at a very least there are four main groups: western, central, easter and south).

What would be the best thing to do:

  • only focus on "rifondou"
  • dump together all ways of writing the language
  • create several profiles (wa@rif, wa@ch, wa@na, wa@lg, wa@ba) ?

Thanks
wa.zip

srtxg commented

Ok, I managed to create it thanks to the help from rmtheis.
I did a pull request ( #50 ) with it.

Thank you! Walloon is in now.
Can you tell us which way you went? Is the language profile only rifondou, or more?

srtxg commented

Thanks,
The pull request I did is only for normalized orthography ("rifondou").

Currently all the walloon language tools (like spell checker, the start of work in grammar tool LT), are in normalized orthography.
However, maybe having a tool to easily and automatically tell in which variant/dialect a text is written could be handy.
I'll a have a meeting this month and bring the topic to see what other people think about it.