sinaahmadi/klpt

morphology for kurmanji

ftyers opened this issue · 7 comments

The apertium project has a morphological analyser for Kurmanji:
https://github.com/apertium/apertium-kmr

You could include it to get morphological analysis for Kurmanji. :)

It recognises 342870 forms and you can get a full form list using the lt-expand tool:

$ lt-expand apertium-kmr.kmr.dix  | wc -l
342870

Thanks, Francis. I was aware of it and will definitely include it.
Thanks for your kind attention 😊

If you would like help using the Apertium tools, please feel free to get in contact with us, either here or on IRC or on our mailing list. We know that our documentation isn't the best in the world!

I'm also happy to do some data conversion if it's more convenient for you.

So, my initial idea was to update the current Hunspell system of Kurmanji which is outdated. A while ago, I was asked by the Open Office community to update it.

I think the best idea would be to convert the Apertium data into a Hunspell Affix file; une pierre, deux coups! Otherwise, using a wrapper should do the job to directly integrate Apertium in KLPT.

Hmm, I think either of those methods could work. If I remember correctly @flammie has some code for this. There is an apertium-python package, but it's very alpha, and you might be better just writing a parser for the output of lt-expand.

Another thing that could be done is just add support for ATT format files, which are transducers in the following format:

$ lt-print kmr.automorf.bin  | head
0	1	y	y	0.000000	
0	2	ê	ê	0.000000	
0	2	e	ê	0.000000	
0	3	b	b	0.000000	
0	4	d	d	0.000000	

I made a basic PR in #5, this code is kind of pedagogical so it isn't super optimised. Måns Huldén has some code for processing ATT files too, you can find it here.

Btw, I had the idea after reading your release notes. As you might be able to tell, I am a huge fan of automata as well! :D

Yeah 😁 Automata are really fun!

Thank you very much again, Francis. This was so quick and efficient. Your contribution should be available in the next release and can be used the exact same way as Sorani:

from klpt.stem import Stem
morph_analyzer = Stem("Kurmanji", "Latin")
print(morph_analyzer.analyze("bibêje"))
[{'base': 'gotin', 'description': 'vblex_tv_prs_p3_sg', 'pos': '', 'terminal_suffix': '', 'formation': ''}, {'base': 'gotin', 'description': 'vblex_tv_imp_p2_sg', 'pos': '', 'terminal_suffix': '', 'formation': ''}, {'base': 'gotin', 'description': 'vblex_tv_fut_p3_sg', 'pos': '', 'terminal_suffix': '', 'formation': ''}]

There are some delicate details that I'll take care of later, particularly structuring the output of theATT analyzer.
Here you can see how it is integrated in the stem module.

You should also appear in the contributors section in the README soon ;-)