/Plugin-Extractocr

Thie omeka plugin allow creation of xml files from pdf using pdftohtml. The xml is stored as a new file associated with the item.

Primary LanguagePHP

Extract OCR (plugin for Omeka)

Summary

Omeka plugin to extract OCR text in XML from PDF files, allowing fulltext searching within BookReader plugin for omeka.

See demo of the in Bibliothèque numérique de l'université Rennes 2 (France).

Installation

  • This plugin needs pdftohtml command-line tool on your server
    sudo apt-get install poppler-utils
  • Upload the Extract OCR plugin folder into your plugins folder on the server;
  • you can install the plugin via github
    cd omeka/plugins  
    git clone git@github.com:symac/Plugin-ExtractOcr.git "ExtractOcr"
  • Activate it from the admin → Settings → Plugins page
  • Click the Configure link to process or not existing PDF files.

Using the PDF TOC Plugin

  • Create an item
  • Add PDF file(s) to this item
  • Save Item
  • To locate extracted OCR xml file, select the item to which the PDF is attached. Normally, you should see an XML file attached to the record with the same filename than the pdf file.

Optional plugins

  • BookReader : This plugin adds Internet Archive BookReader into Omeka. If both plugins (BookReader & ExtractOcr) are installed it's possible to search fulltext within BookReader frame. To enable it you need to overwrite Bookreader/libraries/BookReaderCustom.php using Bookreader/libraries/BookReaderCustom_extractOCR.php

Troubleshooting

See online PDF TOC issues.

License

This plugin is published under [GNU/GPL].

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Contact

  • Syvain Machefert, Université Bordeaux 3 (see symac)