web-crawler

There are 1113 repositories under web-crawler topic.

  • firecrawl

    firecrawl/firecrawl

    🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data

    Language:TypeScript66.9k2577535.2k
  • ScrapeGraphAI/Scrapegraph-ai

    Python scraper based on AI

    Language:Python21.7k1354151.9k
  • crawlee

    apify/crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Language:TypeScript20.5k1231k1.1k
  • crawlab

    crawlab-team/crawlab

    Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

    Language:Go12k2169951.9k
  • ssssssss-team/spider-flow

    新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

    Language:Java11.1k98442.1k
  • apify/crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Language:Python7.1k38474513
  • BruceDone/awesome-crawler

    A collection of awesome web crawler,spider in different languages

  • omniparse

    adithya-s-k/omniparse

    Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

    Language:Python6.7k4290530
  • firecrawl/firecrawl-mcp-server

    🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

    Language:JavaScript4.9k2758525
  • apache/nutch

    Apache Nutch is an extensible and scalable web crawler

    Language:Java3.1k22901.3k
  • jasonxtn/Argus

    The Ultimate Information Gathering Toolkit

    Language:Python2.4k3824267
  • sjdirect/abot

    Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

    Language:C#2.3k196184558
  • xianhu/PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

    Language:Python1.8k11231501
  • MarginaliaSearch/MarginaliaSearch

    Internet search engine for text-oriented websites. Indexing the small, old and weird web.

    Language:HTML1.6k615342
  • oxylabs/ai-crawler-py

    Crawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.

  • JustinBeckwith/linkinator

    Broken link checker that crawls websites and validates links. Find broken links, dead links, and invalid URLs in websites, documentation, and local files. Perfect for SEO audits and CI/CD.

    Language:TypeScript1.1k714990
  • gildas-lormeau/single-file-cli

    CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

    Language:JavaScript1k1014299
  • Algebra-FUN/WeReadScan

    扫描“微信读书”已购图书并下载本地PDF的爬虫

    Language:Python9761130169
  • apache/stormcrawler

    A scalable, mature and versatile web crawler based on Apache Storm

    Language:Java94662860268
  • webrecorder/browsertrix-crawler

    Run a high-fidelity browser-based web archiving crawler in a single Docker container

    Language:TypeScript91022408120
  • postmodern/spidr

    A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

    Language:Ruby8272365109
  • scrapfly/scrapfly-scrapers

    Scalable Python web scraping scripts for +40 popular domains

    Language:Python7461522161
  • cxcscmu/Craw4LLM

    Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

    Language:Python64241058
  • firecrawl/firecrawl-app-examples

    🔥 This repository contains complete application examples, including websites and other projects, developed using Firecrawl.

    Language:Jupyter Notebook59440180
  • devflowinc/firecrawl-simple

    ➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.

    Language:TypeScript53612247
  • VIDA-NYU/ache

    ACHE is a web crawler for domain-specific search.

    Language:Java47533144135
  • hyunwoongko/kochat

    Opensource Korean chatbot framework

    Language:Python4572026184
  • lefterisloukas/edgar-crawler

    The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files. Presented at WWW 2025 @ Sydney, Australia (https://dl.acm.org/doi/10.1145/3701716.3715289)

    Language:Python4522126115
  • USCDataScience/sparkler

    Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

    Language:Java41843153139
  • brendonboshell/supercrawler

    A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

    Language:JavaScript380102863
  • graphlit/graphlit-mcp-server

    Model Context Protocol (MCP) Server for Graphlit Platform

    Language:TypeScript3693149
  • crawler

    crwlrsoft/crawler

    Library for Rapid (Web) Crawler and Scraper Development

    Language:PHP36642213
  • commoncrawl/news-crawl

    News crawling with StormCrawler - stores content as WARC

    Language:Java358315639
  • google-news-scraper

    lewisdonovan/google-news-scraper

    Lightweight scraper for Google News

    Language:TypeScript34894268
  • rivermont/spidy

    The simple, easy to use command line web crawler.

    Language:Python348213769
  • internetarchive/Zeno

    State-of-the-art web crawler 🔱

    Language:Go345812247