content-extractor

Simple Apache Tika and Tesseract OCR powered file content extractor service. Service runs on port 5000 by default and has two endpoints:

post_file_list - consumes POST request with one or more files
post_json_list - consumes POST request with JSON payload which contains URL's to files needed to be downloaded and parsed

To run it on local machine Python 3+, Java 8+, Tesseract and Norwegian language support for Tesseract needed.

All configuration options passes via environmental variables.

TIKA_VERSION = 1.20 (required)
UPLOAD_URL - where to upload parsed data (only for post_json_list endpoint). All files will preserve their names with ".txt" added
FILE_URL - which data entity attribute contains url to file
FILE_NAME - which data entity attribute contains name of file
LOG_LEVEL - logging level ("INFO" by default)
FAIL_ON_ERROR - If need fail fast on parsing or other processing errors or not. True by default
TESSERACT_OCR_LANG - Language(s) to be used by tesseract OCR (need to be installed)
PDF_EXTRACT_INLINE_IMG instruct Tika to retrieve text from images in PDF file true/fals, true by default
PRESERVE_FILE_TYPE if resulted file name will carry information about original file type or not, False by default

timurgen/content-extractor