rTika¶ ↑

A JRuby wrapper around the excellent Apache Tika content extraction library. Feed rTika your files and get extracted text and metadata in return.

Usage¶ ↑

Make sure you’re on JRuby first.

require 'rubygems'
require 'rtika'

result = RTika::FileParser.parse("mywordfile.doc")
puts result.content # prints out the document's contents
puts result.title   # fetches title from the doc's metadata
puts result.author  # fetches author from the doc's metadata

result = RTika::StringParser.parse("<html>

<head><title>MYTITLE</title></head> <body>this is my very … long … string</body></html>“)

puts result.content # returns <body> contents
puts result.title   # returns <title> contents

Options

:remove_boilerplate => true
  # uses the Boilerpipe library that ships with Tika to remove headers & footers

Note on Patches/Pull Requests¶ ↑

Fork the project.
Make your feature addition or bug fix.
Add tests for it. This is important so I don’t break it in a future version unintentionally.
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
Send me a pull request. Bonus points for topic branches.

mr-dxdy/rtika

rTika¶ ↑

Usage¶ ↑

Note on Patches/Pull Requests¶ ↑

Copyright¶ ↑