text-extraction
There are 357 repositories under text-extraction topic.
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
unidoc/unipdf
Golang PDF library for creating and processing PDF files (pure go)
Goldziher/kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
whitelok/image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
miso-belica/jusText
Heuristic based boilerplate removal tool
unidoc/unidoc
This repository has moved! https://github.com/unidoc/unipdf
ICIJ/datashare
A self‑hosted search engine for documents. Help us improve Datashare by answering a survey on structured content: https://forms.gle/PYgusFsoBaMyzUec9
ropensci/pdftools
Text Extraction, Rendering and Converting of PDF Documents
cdown/srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
flairNLP/fundus
A very simple news crawler with a funny name
shixzie/nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
py-pdf/benchmarks
Benchmarking PDF libraries
Goldziher/html-to-markdown
HTML to markdown converter
bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
weareprestatech/hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
SapienzaNLP/extend
Entity Disambiguation as text extraction (ACL 2022)
skylander86/lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.
vsymbol/CUTIE
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
sambitdash/PDFIO.jl
PDF Reader Library for Native Julia.
vaites/php-apache-tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
victorqribeiro/ocr
Simple app to extract text from pictures using Tesseract
lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
jmriebold/BoilerPy3
Python port of Boilerpipe library
docwire/docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
nainiayoub/pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
gamemaker1/office-text-extractor
Yet another library to extract text from MS Office and PDF files
ckorzen/pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
JonathanRaiman/wikipedia_ner
:book: Labeled examples from wiki dumps in Python
iscc/mobi
python based software to unpack kindlegen generated ebooks
abhinaba-ghosh/any-text
Get text content from any file