Here is the the new home of the Zemberek project.
Zemberek-NLP is a Natural Language Processing library. Some modules are specifically developed for Turkish language.
Add this to pom.xml file
<repositories>
<repository>
<id>ahmetaa-repo</id>
<name>ahmetaa Maven Repo on Github</name>
<url>https://raw.github.com/ahmetaa/maven-repo/master</url>
</repository>
</repositories>
And dependecies (For example morphology):
<dependencies>
<dependency>
<groupId>zemberek-nlp</groupId>
<artifactId>morphology</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>
[Zemberek-NLP-Distributions] (https://github.com/ahmetaa/zemberek-nlp-distributions) page has versions and separate module and dependent jars.
Alternatively there is a public [Google docs page] (https://drive.google.com/#folders/0B9TrB39LQKZWSjNKdVcwWUxxUm8) for distributions downloads.
[Turkish-nlp-examples] (https://github.com/ahmetaa/turkish-nlp-examples) contains a maven java project with small usage examples.
- Project requires Java 7.
- Currently word and sentence parse module operations generates parse graph with each initialization. So each run in the system takes some seconds. We will fix it in the next version with fast serialization of the parse graph.
- Morphological parsing does not work for some obvious and frequent words.
- Morphological disambiguation is working less accurate then expected (Not very usable).
- Morphological generation may not work for some obvious Stem-Suffix combinations.
- Please see issues section for further issues and feel free to create new ones.
Core classes such as special Collection classes, Hash functions and helpers.
Turkish morphological parsing, disambiguation and generation. [Morphology Documentation] (https://github.com/ahmetaa/zemberek-nlp/tree/master/morphology)
Turkish Tokenization and sentence boundary detection. So far only rule based algorithms.
Turkish syllabification and hyphenation.
[Language model compression] (https://github.com/ahmetaa/zemberek-nlp/tree/master/lm)
Please refer to contributors.txt file.
Portions of this code has been developed in Tübitak BİLGEM's Speech and Language Technologies Laboratory.