web-crawler

There are 965 repositories under web-crawler topic.

  • mendableai/firecrawl

    🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

    Language:TypeScript33.5k1775422.9k
  • crawlee

    apify/crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Language:TypeScript17.3k108935781
  • crawlab

    crawlab-team/crawlab

    Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

    Language:Go11.7k2149581.8k
  • ssssssss-team/spider-flow

    新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。

    Language:Java9.9k93431.9k
  • BruceDone/awesome-crawler

    A collection of awesome web crawler,spider in different languages

  • omniparse

    adithya-s-k/omniparse

    Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

    Language:Python6.4k4286519
  • apify/crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Language:Python5.5k34369367
  • apache/nutch

    Apache Nutch is an extensible and scalable web crawler

    Language:Java3k23401.3k
  • sjdirect/abot

    Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

    Language:C#2.3k197184561
  • mendableai/firecrawl-mcp-server

    Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

    Language:JavaScript2k1416177
  • jasonxtn/Argus

    The Ultimate Information Gathering Toolkit

    Language:Python1.9k3121208
  • xianhu/PSpider

    简单易用的Python爬虫框架,QQ交流群:597510560

    Language:Python1.8k11331502
  • MarginaliaSearch/MarginaliaSearch

    Internet search engine for text-oriented websites. Indexing the small, old and weird web.

    Language:HTML1.3k99230
  • Algebra-FUN/WeReadScan

    扫描“微信读书”已购图书并下载本地PDF的爬虫

    Language:Python9381130163
  • apache/incubator-stormcrawler

    A scalable, mature and versatile web crawler based on Apache Storm

    Language:Java90464839262
  • platonai/PulsarRPA

    Automate webpages at scale, scrape web data completely and accurately with high performance, distributed AI-RPA.

    Language:Kotlin8302070122
  • postmodern/spidr

    A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

    Language:Ruby8152764108
  • gildas-lormeau/single-file-cli

    CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

    Language:JavaScript7541112974
  • webrecorder/browsertrix-crawler

    Run a high-fidelity browser-based web archiving crawler in a single Docker container

    Language:TypeScript7292337098
  • cxcscmu/Craw4LLM

    Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"

    Language:Python6014953
  • VIDA-NYU/ache

    ACHE is a web crawler for domain-specific search.

    Language:Java46434144135
  • scrapfly/scrapfly-scrapers

    Scalable Python web scraping scripts for +40 popular domains

    Language:Python4591016115
  • hyunwoongko/kochat

    Opensource Korean chatbot framework

    Language:Python4542026186
  • USCDataScience/sparkler

    Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

    Language:Java41343153140
  • devflowinc/firecrawl-simple

    ➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.

    Language:TypeScript40111630
  • brendonboshell/supercrawler

    A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.

    Language:JavaScript380102861
  • crawler

    crwlrsoft/crawler

    Library for Rapid (Web) Crawler and Scraper Development

    Language:PHP36141913
  • lefterisloukas/edgar-crawler

    The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.

    Language:Python3572023101
  • rivermont/spidy

    The simple, easy to use command line web crawler.

    Language:Python346223769
  • commoncrawl/news-crawl

    News crawling with StormCrawler - stores content as WARC

    Language:Java339325636
  • infinilabs/crawler

    🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)

    Language:Go308253282
  • google-news-scraper

    lewisdonovan/google-news-scraper

    Lightweight scraper for Google News

    Language:TypeScript30594266
  • s0rg/crawley

    The unix-way web crawler

    Language:Go2902816
  • yields/ant

    A web crawler for Go

    Language:Go2785417
  • microfisher/Strong-Web-Crawler

    基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。

    Language:C#275391150
  • duyet/awesome-web-scraper

    A collection of awesome web scaper, crawler.