data-extraction
There are 1072 repositories under data-extraction topic.
firecrawl/firecrawl
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
getmaxun/maxun
⚡ Easiest no code web data extraction platform • Instantly turn any website into API or spreadsheet ⚡
D4Vinci/Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
vi3k6i5/flashtext
Extract Keywords from sentence or Replace keywords in sentences.
shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
JonathanLink/PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
brightdata/brightdata-mcp
A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
raznem/parsera
Lightweight library for scraping web-sites with LLMs
saifyxpro/HeadlessX
A lightweight, self-hosted headless browser automation platform. Designed as an alternative to Browserless, built for speed, privacy, and scalability.
thinh-vu/vnstock
A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone
polyrabbit/hacker-news-digest
:newspaper: Let ChatGPT Summarize Hacker News for You
adrienjoly/npm-pdfreader
🚜 Parse text and tables from PDF files.
ScrapeGraphAI/scrapecraft
🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.
eclaire-labs/eclaire
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
a-maliarov/amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
py-pdf/benchmarks
Benchmarking PDF libraries
jpjacobpadilla/Stealth-Requests
Undetected web-scraping & seamless HTML parsing in Python!
serpapi/clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
molybdenum-99/infoboxer
Wikipedia information extraction library
sypht-team/sypht-python-client
A python client for the Sypht API
dilawar/PlotDigitizer
A Python utility to digitize plots.
johnbumgarner/newspaper3_usage_overview
This repository provides usage examples for the Python module Newspaper3k.
CambioML/any-parser
Accurate, private and configurable document retrieval LLM
nfx/go-htmltable
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
173TECH/sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
hermit-crab/ScrapeMate
Scraping assistant tool. Editing and maintaining CSS/XPath selectors across webpages.
tech-engine/goscrapy
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
Zubdata/Google-Maps-Scraper
Google maps scraper with gui
reincubate/ricloud
Python client for Reincubate's ricloud API. Yes, it works with iOS 14 & iPhone 12 backups!
sshniro/line-segmentation-algorithm-to-gcp-vision
Line segmentation algorithm for Google Vision API.
chenkovsky/cyac
High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!
docwire/docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
dav009/flash
Golang Keyword extraction/replacement Datastructure using Tries instead of regexes