web-crawler
There are 1113 repositories under web-crawler topic.
firecrawl/firecrawl
🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data
ScrapeGraphAI/Scrapegraph-ai
Python scraper based on AI
apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
crawlab-team/crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
ssssssss-team/spider-flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
BruceDone/awesome-crawler
A collection of awesome web crawler,spider in different languages
adithya-s-k/omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
firecrawl/firecrawl-mcp-server
🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.
apache/nutch
Apache Nutch is an extensible and scalable web crawler
jasonxtn/Argus
The Ultimate Information Gathering Toolkit
sjdirect/abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
xianhu/PSpider
简单易用的Python爬虫框架,QQ交流群:597510560
MarginaliaSearch/MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
oxylabs/ai-crawler-py
Crawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.
JustinBeckwith/linkinator
Broken link checker that crawls websites and validates links. Find broken links, dead links, and invalid URLs in websites, documentation, and local files. Perfect for SEO audits and CI/CD.
gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
Algebra-FUN/WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
apache/stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
scrapfly/scrapfly-scrapers
Scalable Python web scraping scripts for +40 popular domains
cxcscmu/Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
firecrawl/firecrawl-app-examples
🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.
devflowinc/firecrawl-simple
➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.
VIDA-NYU/ache
ACHE is a web crawler for domain-specific search.
hyunwoongko/kochat
Opensource Korean chatbot framework
lefterisloukas/edgar-crawler
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files. Presented at WWW 2025 @ Sydney, Australia (https://dl.acm.org/doi/10.1145/3701716.3715289)
USCDataScience/sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
brendonboshell/supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
lewisdonovan/google-news-scraper
Lightweight scraper for Google News
rivermont/spidy
The simple, easy to use command line web crawler.
internetarchive/Zeno
State-of-the-art web crawler 🔱