text-extraction

There are 383 repositories under text-extraction topic.

adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Language:Python4.9k 32 410326
miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
Language:Python3.6k 111 125540
unidoc/unipdf
Golang PDF library for creating and processing PDF files (pure go)
Language:Go2.9k 28 322277
Goldziher/kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
Language:HTML2.5k 12 53111
chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Language:Python1.6k 40 296245
whitelok/image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約
955 73 0233
miso-belica/jusText
Heuristic based boilerplate removal tool
Language:Python803 19 3086
unidoc/unidoc
This repository has moved! https://github.com/unidoc/unipdf
Language:Go707 15 086
ICIJ/datashare
A self‑hosted search engine for documents
Language:Java667 27 1.6k64
ropensci/pdftools
Text Extraction, Rendering and Converting of PDF Documents
Language:C++541 27 11672
cdown/srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
Language:Python526 14 8949
iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
Language:Python441 5 2562
flairNLP/fundus
A very simple news crawler with a funny name
Language:Python415 8 129108
shixzie/nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Language:Go389 19 932
Goldziher/html-to-markdown
HTML to markdown converter
Language:HTML331 4 3535
pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Language:HTML329 7 2239
py-pdf/benchmarks
Benchmarking PDF libraries
Language:Python315 5 1020
bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Language:HTML205 20 2325
weareprestatech/hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Language:Python196 2 239
SapienzaNLP/extend
Entity Disambiguation as text extraction (ACL 2022)
Language:Python182 5 1113
skylander86/lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.
Language:Python176 9 544
vsymbol/CUTIE
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Language:Python157 15 1577
archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Language:Scala147 14 27334
sambitdash/PDFIO.jl
PDF Reader Library for Native Julia.
Language:Julia134 4 8116
vaites/php-apache-tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Language:PHP117 5 2724
victorqribeiro/ocr
Simple app to extract text from pictures using Tesseract
Language:HTML106 2 09
lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Language:Go104 4 1017
iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Language:Python101 3 28
docwire/docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
Language:C++94 8 8424
jmriebold/BoilerPy3
Python port of Boilerpipe library
Language:Python93 3 417
nainiayoub/pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
Language:Python91 4 450
gamemaker1/office-text-extractor
Yet another library to extract text from MS Office and PDF files
Language:TypeScript84 2 157
ckorzen/pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
Language:TeX69 6 211
iscc/mobi
python based software to unpack kindlegen generated ebooks
Language:Python68 1 138
JonathanRaiman/wikipedia_ner
:book: Labeled examples from wiki dumps in Python
Language:Jupyter Notebook67 3 27
abhinaba-ghosh/any-text
Get text content from any file
Language:JavaScript64 1 1610

text-extraction

adbar/trafilatura

miso-belica/sumy

unidoc/unipdf

Goldziher/kreuzberg

chrismattmann/tika-python

whitelok/image-text-localization-recognition

miso-belica/jusText

unidoc/unidoc

ICIJ/datashare

ropensci/pdftools

cdown/srt

iamarunbrahma/vision-parse

flairNLP/fundus

shixzie/nlp

Goldziher/html-to-markdown

pd3f/pd3f

py-pdf/benchmarks

bookieio/breadability

weareprestatech/hotpdf

SapienzaNLP/extend

skylander86/lambda-text-extractor

vsymbol/CUTIE

archivesunleashed/aut

sambitdash/PDFIO.jl

vaites/php-apache-tika

victorqribeiro/ocr

lu4p/cat

iamarunbrahma/pdf-to-markdown

docwire/docwire

jmriebold/BoilerPy3

nainiayoub/pdf-text-data-extractor

gamemaker1/office-text-extractor

ckorzen/pdf-text-extraction-benchmark

iscc/mobi

JonathanRaiman/wikipedia_ner

abhinaba-ghosh/any-text