data-extraction
There are 546 repositories under data-extraction topic.
vi3k6i5/flashtext
Extract Keywords from sentence or Replace keywords in sentences.
JonathanLink/PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
polyrabbit/hacker-news-digest
:newspaper: Let ChatGPT Summarize Hacker News for You
adrienjoly/npm-pdfreader
🚜 Parse text and tables from PDF files.
a-maliarov/amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
thinh-vu/vnstock
A powerful Python library for getting rich data from the Vietnam Stock Market using just a few lines of code
molybdenum-99/infoboxer
Wikipedia information extraction library
sypht-team/sypht-python-client
A python client for the Sypht API
py-pdf/benchmarks
Benchmarking PDF libraries
serpapi/clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
johnbumgarner/newspaper3_usage_overview
This repository provides usage examples for the Python module Newspaper3k.
173TECH/sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
nfx/go-htmltable
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
dilawar/PlotDigitizer
A Python utility to digitize plots.
villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
hermit-crab/ScrapeMate
Scraping assistant tool. Editing and maintaining CSS/XPath selectors across webpages.
nppoly/cyac
High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python
sshniro/line-segmentation-algorithm-to-gcp-vision
Line segmentation algorithm for Google Vision API.
reincubate/ricloud
Python client for Reincubate's ricloud API. Yes, it works with iOS 14 & iPhone 12 backups!
dav009/flash
Golang Keyword extraction/replacement Datastructure using Tries instead of regexes
sypht-team/sypht-java-client
A Java client for the Sypht API
danburzo/hred
Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.
WeTransfer/format_parser
file metadata parsing, done cheap
docwire/docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
rohanpillai20/Table-Extractor-From-Image
This repository contains the code that extracts a table from an image and exports it to an Excel.
scopashq/typestream
⚡️ Next-generation data transformation framework for TypeScript that puts developer experience first
uhh-lt/newsleak
Information extraction and interactive visualization of textual datasets for investigative data-driven journalism and eDiscovery
html-extract/hext
Domain-specific language for extracting structured data from HTML documents
VorTECHsa/refinery
Refinery is a tool to extract and transform semi-structured data from Excel spreadsheets of different layouts in a declarative way.
tech-engine/goscrapy
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
rekloud/tinvois-parser
Extract receipt info
Articdive/ArticData
Collection of data extracted from Minecraft.
linw1995/jsonpath
A query expression for extracting data from JSON.
Zubdata/Google-Maps-Scraper
Google maps scraper with gui
shriprem/FWDataViz
Fixed Width Data Visualizer plugin for Notepad++. Turns Notepad++ into Excel for fixed-width data files. Displays cursor position data. Jumps to specific fields. Folding Record Blocks. Extracts Data. Builtin dialogs to configure file-type, record-type & fields; Themes & Colors; and Folding. Handles homogenous, mixed & multi-line records.