text-extraction
There are 244 repositories under text-extraction topic.
adbar/trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
unidoc/unipdf
Golang PDF library for creating and processing PDF files (pure go)
chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
whitelok/image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
miso-belica/jusText
Heuristic based boilerplate removal tool
unidoc/unidoc
This repository has moved! https://github.com/unidoc/unipdf
ICIJ/datashare
A self-hosted search engine for documents.
ropensci/pdftools
Text Extraction, Rendering and Converting of PDF Documents
cdown/srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
shixzie/nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
flairNLP/fundus
A very simple news crawler with a funny name
pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
py-pdf/benchmarks
Benchmarking PDF libraries
bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
weareprestatech/hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
SapienzaNLP/extend
Entity Disambiguation as text extraction (ACL 2022)
skylander86/lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.
vsymbol/CUTIE
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
sambitdash/PDFIO.jl
PDF Reader Library for Native Julia.
vaites/php-apache-tika
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
victorqribeiro/ocr
Simple app to extract text from pictures using Tesseract
lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
jmriebold/BoilerPy3
Python port of Boilerpipe library
nainiayoub/pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
docwire/docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
gamemaker1/office-text-extractor
Yet another library to extract text from MS Office and PDF files
JonathanRaiman/wikipedia_ner
:book: Labeled examples from wiki dumps in Python
ckorzen/pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
abhinaba-ghosh/any-text
Get text content from any file
iscc/mobi
python based software to unpack kindlegen generated ebooks
fourdigits/wagtail_textract
Text extraction for Wagtail document search
pd3f/pd3f-core
📑 Python Package to reconstruct the original continuous text from PDFs with language models
hscspring/pnlp
NLP预/后处理工具。