web-crawler

There are 1117 repositories under web-crawler topic.

firecrawl/firecrawl
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
Language:TypeScript67k 257 7535.2k
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
Language:Python21.7k 135 4151.9k
apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language:TypeScript20.5k 124 1k1.1k
crawlab-team/crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台，支持任何语言和框架
Language:Go12k 216 9951.9k
ssssssss-team/spider-flow
新一代爬虫平台，以图形化方式定义爬虫流程，不写代码即可完成爬虫。
Language:Java11.1k 98 442.1k
apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language:Python7.1k 38 474513
BruceDone/awesome-crawler
A collection of awesome web crawler,spider in different languages
7k 201 19733
adithya-s-k/omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Language:Python6.7k 42 90530
firecrawl/firecrawl-mcp-server
🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.
Language:JavaScript4.9k 27 58525
apache/nutch
Apache Nutch is an extensible and scalable web crawler
Language:Java3.1k 229 01.3k
jasonxtn/Argus
The Ultimate Information Gathering Toolkit
Language:Python2.4k 38 24267
sjdirect/abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Language:C#2.3k 196 184558
xianhu/PSpider
简单易用的Python爬虫框架，QQ交流群：597510560
Language:Python1.8k 112 31501
MarginaliaSearch/MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
Language:HTML1.6k 6 15342
oxylabs/ai-crawler-py
Crawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.
1.2k4
JustinBeckwith/linkinator
Broken link checker that crawls websites and validates links. Find broken links, dead links, and invalid URLs in websites, documentation, and local files. Perfect for SEO audits and CI/CD.
Language:TypeScript1.1k 7 14990
gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Language:JavaScript1k 10 14299
Algebra-FUN/WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
Language:Python975 11 30169
apache/stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
Language:Java946 62 860268
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Language:TypeScript911 22 408120
postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Language:Ruby827 23 65109
scrapfly/scrapfly-scrapers
Scalable Python web scraping scripts for +40 popular domains
Language:Python747 15 22161
cxcscmu/Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
Language:Python642 4 1058
firecrawl/firecrawl-app-examples
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
Language:Jupyter Notebook595 4 0181
devflowinc/firecrawl-simple
➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.
Language:TypeScript536 1 2247
VIDA-NYU/ache
ACHE is a web crawler for domain-specific search.
Language:Java475 33 144135
hyunwoongko/kochat
Opensource Korean chatbot framework
Language:Python457 19 26184
lefterisloukas/edgar-crawler
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files. Presented at WWW 2025 @ Sydney, Australia (https://dl.acm.org/doi/10.1145/3701716.3715289)
Language:Python452 21 26115
USCDataScience/sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Language:Java418 43 153139
brendonboshell/supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Language:JavaScript380 10 2863
graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
Language:TypeScript369 3 149
crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
Language:PHP366 4 2213
commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
Language:Java358 31 5639
lewisdonovan/google-news-scraper
Lightweight scraper for Google News
Language:TypeScript348 9 4268
rivermont/spidy
The simple, easy to use command line web crawler.
Language:Python348 21 3769
internetarchive/Zeno
State-of-the-art web crawler 🔱
Language:Go345 8 12247

web-crawler

firecrawl/firecrawl

ScrapeGraphAI/Scrapegraph-ai

apify/crawlee

crawlab-team/crawlab

ssssssss-team/spider-flow

apify/crawlee-python

BruceDone/awesome-crawler

adithya-s-k/omniparse

firecrawl/firecrawl-mcp-server

apache/nutch

jasonxtn/Argus

sjdirect/abot

xianhu/PSpider

MarginaliaSearch/MarginaliaSearch

oxylabs/ai-crawler-py

JustinBeckwith/linkinator

gildas-lormeau/single-file-cli

Algebra-FUN/WeReadScan

apache/stormcrawler

webrecorder/browsertrix-crawler

postmodern/spidr

scrapfly/scrapfly-scrapers

cxcscmu/Craw4LLM

firecrawl/firecrawl-app-examples

devflowinc/firecrawl-simple

VIDA-NYU/ache

hyunwoongko/kochat

lefterisloukas/edgar-crawler

USCDataScience/sparkler

brendonboshell/supercrawler

graphlit/graphlit-mcp-server

crwlrsoft/crawler

commoncrawl/news-crawl

lewisdonovan/google-news-scraper

rivermont/spidy

internetarchive/Zeno