/awesome-scrapy

A curated list of awesome packages, articles, and other cool resources from the Scrapy community.

Awesome Scrapy Awesome

A curated list of awesome packages, articles, and other cool resources from the Scrapy community. Scrapy is a fast high-level web crawling & scraping framework for Python.

Table of Contents

Apps

Visual Web Scraping

  • Portia Visual scraping for Scrapy

Distributed Spider

Scrapy Service

  • scrapyscript Run a Scrapy spider programmatically from a script or a Celery task - no project required.

  • scrapyd A service daemon to run Scrapy spiders

  • scrapyd-client Command line client for Scrapyd server

  • python-scrapyd-api A Python wrapper for working with Scrapyd's API.

  • SpiderKeeper A scalable admin ui for spider service

  • scrapyrt HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.

Front-End Scrapy Managers

  • Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js

  • SpiderKeeper admin ui for scrapy/open source scrapinghub.

  • ScrapydWeb Scrapyd cluster management, Scrapy log analysis & visualization, Basic auth, Auto packaging, Timer Tasks, Email notice, and Mobile UI.

Monitor

  • scrapy-sentry Logs Scrapy exceptions into Sentry

  • scrapy-statsd-middleware Statsd integration middleware for scrapy

  • scrapy-jsonrpc An extension to control a running Scrapy web crawler via JSON-RPC

  • scrapy-fieldstats A Scrapy extension to log items coverage when the spider shuts down

  • spidermon Extension which provides useful tools for data validation, stats monitoring, and notification messages.

Avoid Ban

  • HttpProxyMiddleware A middleware for scrapy. Used to change HTTP proxy from time to time.

  • scrapy-proxies Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.

  • scrapy-rotating-proxies Use multiple proxies with Scrapy

  • scrapy-random-useragent Scrapy Middleware to set a random User-Agent for every Request.

  • scrapy-fake-useragent Random User-Agent middleware based on fake-useragent

  • scrapy-crawlera Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.

Data Processing

Process Javascript

Other Useful Extensions

  • scrapy-djangoitem Scrapy extension to write scraped items using Django models

  • scrapy-deltafetch Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls

  • scrapy-crawl-once This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.

  • scrapy-magicfields Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.

  • scrapy-pagestorage A scrapy extension to store requests and responses information in storage service.

  • itemloaders Library to populate items using XPath and CSS with a convenient API.

  • itemadapter Adapter which provides a common interface to handle objects of different types in an uniform manner.

  • scrapy-poet Page Object pattern implementation which enables writing reusable and portable extraction and crawling code.

Resources

Articles

Exercises

Video

Book