DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.
For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion. I found
Need OCR support or in-image text parsing? Take a look at Docsplit.
gem install doc_ripper
DocRipper::TextRipper.new('/path/to/file')
dr = DocRipper::TextRipper.new('/path/to/file')
dr.text
=> "Document's text"
If the file cannot be read, nil will be returned.
dr = DocRipper::TextRipper.new('/path/to/missing/file')
dr.text
=> nil
- Ruby version >= 1.9.2
- Poppler-utils/(pdftotext) (PDF)
- Antiword (docx) more info: http://linux.die.net/man/1/antiword