/doc_ripper

Parse text contents from common file formats

Primary LanguageRubyMIT LicenseMIT

DocRipper

Gem Version

Grab the text from common document formats with 1 command. DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf, .sketch) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.

For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion.

Need OCR support or in-image text parsing? Take a look at Docsplit.

Supported File Formats

.doc
.docx
.pdf
.txt
.sketch
File format Supported? Dependencies
.doc x Antiword
.docx x
.pdf x Poppler-utils
.txt x
.sketch x

Quickstart

  gem install doc_ripper

Specify a file path of a file

  require 'doc_ripper'

  DocRipper::rip('/path/to/file')

If the file cannot be read, nil will be returned.

  DocRipper::rip('/path/to/missing/file')
  => nil

Want to raise an exception? Use #rip!

#rip! will raise an exception if rip returns nil or the file type isn't supported

  # invalid file type
  DocRipper::rip!('/path/to/invalide/file.type')
  => DocRipper::UnsupportedFileType

  # missing file
  DocRipper::rip!('/path/to/missing/file.doc')
  => DocRipper::FileNotFound

Dependencies