data-extraction

There are 1072 repositories under data-extraction topic.

firecrawl/firecrawl
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
Language:TypeScript67k 257 7535.2k
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
Language:Python21.7k 135 4151.9k
getmaxun/maxun
⚡ Easiest no code web data extraction platform • Instantly turn any website into API or spreadsheet ⚡
Language:TypeScript13.8k 78 2741.1k
D4Vinci/Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
Language:Python8.1k 50 44463
vi3k6i5/flashtext
Extract Keywords from sentence or Replace keywords in sentences.
Language:Python5.7k 139 114605
shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
Language:Python1.7k 13 13135
JonathanLink/PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
Language:Java1.6k 52 34214
brightdata/brightdata-mcp
A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.
Language:JavaScript1.6k 8 26204
hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Language:Python1.5k 36 219233
raznem/parsera
Lightweight library for scraping web-sites with LLMs
Language:Python1.2k 19 1769
saifyxpro/HeadlessX
A lightweight, self-hosted headless browser automation platform. Designed as an alternative to Browserless, built for speed, privacy, and scalability.
Language:JavaScript1.1k143
thinh-vu/vnstock
A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone
Language:Python1k 50 105222
polyrabbit/hacker-news-digest
:newspaper: Let ChatGPT Summarize Hacker News for You
Language:Python733 17 2395
adrienjoly/npm-pdfreader
🚜 Parse text and tables from PDF files.
Language:HTML692 8 7787
ScrapeGraphAI/scrapecraft
🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.
Language:Python551 4 190
eclaire-labs/eclaire
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
Language:TypeScript48849
a-maliarov/amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
Language:Python486 15 4891
py-pdf/benchmarks
Benchmarking PDF libraries
Language:Python315 5 1020
jpjacobpadilla/Stealth-Requests
Undetected web-scraping & seamless HTML parsing in Python!
Language:Python311 4 417
serpapi/clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
Language:Ruby187 6 113
molybdenum-99/infoboxer
Wikipedia information extraction library
Language:Ruby176 9 7913
sypht-team/sypht-python-client
A python client for the Sypht API
Language:Python162 4 05
dilawar/PlotDigitizer
A Python utility to digitize plots.
Language:Python155 9 1525
johnbumgarner/newspaper3_usage_overview
This repository provides usage examples for the Python module Newspaper3k.
Language:Python148 4 116
CambioML/any-parser
Accurate, private and configurable document retrieval LLM
Language:Python130 3 014
nfx/go-htmltable
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
Language:Go122 3 29
173TECH/sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Language:Python120 5 4715
villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
Language:Python108 1 23
hermit-crab/ScrapeMate
Scraping assistant tool. Editing and maintaining CSS/XPath selectors across webpages.
Language:JavaScript105 6 314
tech-engine/goscrapy
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
Language:Go101 6 12
Zubdata/Google-Maps-Scraper
Google maps scraper with gui
Language:Python100 1 1036
reincubate/ricloud
Python client for Reincubate's ricloud API. Yes, it works with iOS 14 & iPhone 12 backups!
Language:Python96 19 825
sshniro/line-segmentation-algorithm-to-gcp-vision
Line segmentation algorithm for Google Vision API.
Language:Kotlin96 11 1636
chenkovsky/cyac
High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!
Language:Cython95 5 1315
docwire/docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
Language:C++94 8 8424
dav009/flash
Golang Keyword extraction/replacement Datastructure using Tries instead of regexes
Language:Go89 3 06

data-extraction

firecrawl/firecrawl

ScrapeGraphAI/Scrapegraph-ai

getmaxun/maxun

D4Vinci/Scrapling

vi3k6i5/flashtext

shcherbak-ai/contextgem

JonathanLink/PDFLayoutTextStripper

brightdata/brightdata-mcp

hi-primus/optimus

raznem/parsera

saifyxpro/HeadlessX

thinh-vu/vnstock

polyrabbit/hacker-news-digest

adrienjoly/npm-pdfreader

ScrapeGraphAI/scrapecraft

eclaire-labs/eclaire

a-maliarov/amazoncaptcha

py-pdf/benchmarks

jpjacobpadilla/Stealth-Requests

serpapi/clauneck

molybdenum-99/infoboxer

sypht-team/sypht-python-client

dilawar/PlotDigitizer

johnbumgarner/newspaper3_usage_overview

CambioML/any-parser

nfx/go-htmltable

173TECH/sayn

villagecomputing/superpipe

hermit-crab/ScrapeMate

tech-engine/goscrapy

Zubdata/Google-Maps-Scraper

reincubate/ricloud

sshniro/line-segmentation-algorithm-to-gcp-vision

chenkovsky/cyac

docwire/docwire

dav009/flash