/awesome-web-scraper

A collection of awesome web scaper, crawler.

MIT LicenseMIT

Awesome Web Scraper Awesome Build Status

A collection of awesome web scaper, crawler.

Java

  • Apache Nutch - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
  • websphinx - Website-Specific Processors for HTML INformation eXtraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

C/C++

  • HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

Erlang

  • ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

Python

  • scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
  • gdom - gdom, DOM Traversing and Scraping using GraphQL.

PHP

  • Goutte - Goutte, a simple PHP Web Scraper.
  • DiDOM - Simple and fast HTML parser.
  • simple_html_dom - Just a Simple HTML DOM library fork.
  • PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.

Nodejs

  • Phantomjs - Scriptable Headless WebKit.
  • node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
  • node-simplecrawler - Flexible event driven crawler for node.
  • spider - Programmable spidering of web sites with node.js and jQuery.
  • slimerjs - A PhantomJS-like tool running Gecko.
  • casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
  • zombie - Insanely fast, full-stack, headless browser testing using node.js.
  • nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
  • jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
  • xray - The next web scraper. See through the <html> noise.

Ruby

  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Go

  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

License

MIT

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

Feel free to open an issue or create a pull request with your additions.