/choppa

Partial python port of java SRX segmenter

Primary LanguagePythonMIT LicenseMIT

choppa

Partial python port of java SRX segmenter, originally written by Jarek Lipski.

In a nutshell, it allows you to tokenize texts into sentences (but generally, it's rule-based so that you can chop anything textual).

Shipped with segment.srx set of segmentation rules for different languages, crafted by the great team of languagetool.

Quick Start

pip3 install git+https://github.com/lang-uk/choppa.git

cat << EOF | python3 -m choppa
Жоден сучасний електронний прилад не обходиться без мікрочипів. Мікрочіп, інакше кажучи, мікросхема - це набір електронних схем на невеликому плоскому шматку кремнію.
EOF

See choppa/main.py for a Python usage example.

Current status and plans

That port currently covers:

  • All structures (structures.py) necessary for the parser to operate (Rule, LanguageRule, LanguageMap)
  • Abstract, Accurate (legacy), and SrxTextIterator iterator (iterators.py), which basically segments text into chunks according to the SRX rules
  • Extra classes required for the SrxTextIterator (TextManager, RuleManager)
  • Some utils (utils.py), for regex mangling
  • SAX based parser (srx_parser.py) to read SRX rules from xml files (SRX2.0 only)
  • SrxDocument (again srx_parser.py) class which allows you to manage rules and cache regexes
  • A partial implementation of Java Matcher class, which is absent in python.
  • Tests for everything above (and beyond)
  • Additional tests from LanguageTool for Ukrainian language
  • Type hints

I also pythonized the code to some extent (by removing some setters/getters, snake_casing methods, and variables and adapting data structures).

Important notes

First and foremost, I would like to thank Jarek for his work and code quality. My project is not original, it just brings the power of srx segmenter to the python world. And it relies entirely on the work done by Jarek.

Please pay attention to the fact that only Accurate iterator and Ultimate iterator is currently implemented (and I don't have immediate plans to implement the rest). Accurate Iterator should work well on relatively small documents (i.e. do not use it on multi GB plaintext corpora!), but known for some bugs. Ultimate iterator from the original library is also ported, allowing to parse large documents efficiently while sacrificing accuracy (limiting look-behind patterns, etc). If you need other iterators or are keen to optimize that beast — I'm always open for the pull requests. Similarly, I've only implemented SAX reader for rules and I'm using xmlschema package for schema validation.

Also, I don't have any plan of porting UI at all. You can reuse some of UI's available.

Copyrights and kudos

  • Python port: Dmytro Chaplynskyi
  • Original Java implementation: Jarek Lipski
  • Segmentation rules: Daniel Naber, Jaume Ortolà et al (153 contributors!)
  • Special thanks to Andriy Rysin, the driving force behind the Ukrainian language in LanguageTool