/srx-ruby

An SRX segmenting engine for Ruby

Primary LanguageRubyMIT LicenseMIT

SRX for Ruby

SRX is a specification for segmenting text, i.e. splitting text into sentences. More specifically it is

  • An XML-based format for specifying segmentation rules, and
  • An algorithm by which the rules are applied

See the SRX 2.0 Specification for full details.

This gem provides facilities for reading SRX files and an engine for performing segmentation.

Only a minimal rule set is supplied by default; for actual usage you are encouraged to supply your own SRX rules. One such set of rules is that from LanguageTool; this is conveniently packaged into a companion gem: srx-languagetool-ruby.

What's different about this gem?

There are lots of good segmentation gems out there such as

What makes SRX different is:

  • It allows easy customization and exchange of rules via SRX files
  • It preserves whitespace surrounding break points
  • It offers advanced XML/HTML tag handling: it won't be fooled by false breaks in e.g. attribute values

Some other advantages that are not unique to SRX:

  • It is offered under a very permissive license
  • It is relatively lightweight as a dependency
  • It is fast (though this depends somewhat on the ruleset you use)

Some disadvantages:

  • It is inherently rule-based, with all of the weaknesses that implies
  • It is not very accurate on the Golden Rules test, scoring 47% (English) and 48% (others) with the default rules. However you can improve on that with better rules such as LanguageTool's.

Caveats

The SRX spec calls for ICU regular expressions, but this library uses standard Ruby regexp. Please note:

  • Not all ICU syntax is supported
  • For supported syntax, in some cases the meaning of a regex may differ when interpreted as Ruby regexp
  • The following ICU syntax is supported through translation to Ruby syntax:
    • \x{hhhh}\u{hhhh}
    • \0ooo\u{hhhh}

Installation

Add this line to your application's Gemfile:

gem 'srx'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install srx

Usage

Use the default rules like so. Specify the language according the <maprules> of your SRX (usually two-letter ISO 639-1 codes).

require 'srx'

data = Srx::Data.default
engine = Srx::Engine.new(data)
engine.segment('Hi. How are you?', language: 'en') #=> ["Hi.", " How are you?"]

Or bring your own rules:

data = Srx::Data.from_file(path: 'path/to/my/rules.srx')
engine = Srx::Engine.new(data)

Specify the format as :xml or :html to benefit from special handling of tags:

# This should only be one segment, but handling as plain text incorrectly
# produces two segments.
input = 'foo <bar baz="a. b."> bazinga'

Srx::Engine.new(Data.default).segment(input, language: 'en')
#=> ["foo <bar baz=\"a.", " b.\"> bazinga"]

Srx::Engine.new(Data.default, format: :xml).segment(input, language: 'en')
#=> ["foo <bar baz=\"a. b.\"> bazinga"]

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake test to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and the created tag, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/amake/srx.

License

The gem is available as open source under the terms of the MIT License.