/cookiecutter-scrapycloud

A bare minimum Scrapy project template ready for Scrapinghub's Scrapy Cloud service.

Primary LanguagePythonMIT LicenseMIT

Scrapy Project Cookiecutter Template

A bare minimum cookiecutter template for a Scrapy project.

What is included:

  • Default stack is Scrapy 2.0.1 and Python 3.8.
  • Standard project files generated by scrapy startproject.
  • A default requirements.txt managed by pip-compile.
  • A default scrapinghub.yml, Scrapy Cloud configuration file.
  • A default Dockerfile to build a docker image to replicate the Scrapy Cloud running container.
  • Useful scripts to archive job items and dump items from a Collection.

Usage

  1. Install cookiecutter if you do not have it already: pip install cookiecutter
  2. Generate a new project: cookiecutter gh:krectra/cookiecutter-scrapycloud

Example

We will show how to create and deploy a Scrapy project to Scrapy Cloud. You will need to have an account in Scrapy Cloud and a Scrapy Project already created. In this example, we use OSX as development machine.

First we bootstrap our project and create a spider:

$ cookiecutter gh:rolando/cookiecutter-scrapycloud
project_name [Project Name]: myproject
project_slug [myproject]:
project_module [myproject]:
scrapycloud_id [Scrapy Cloud Project ID]: 12345
$ cd myproject
$ pip install -r requirements.txt -r dev-requirements.txt
$ cat > myproject/spiders/myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for url in response.css('ul li a::attr("href")').re('.*/category/.*'):
            yield scrapy.Request(response.urljoin(url), self.parse_titles)

    def parse_titles(self, response):
        for post_title in response.css('div.entries > ul > li a::text').extract():
            yield {'title': post_title}

EOF

Now we can run the spider in our host:

$ scrapy crawl blogspider

Also we can build a docker image to replicate the Scrapy Cloud container:

$ docker build -t myproject .
$ docker run -it myproject scrapy crawl blogspider

Finally, we can deploy to our Scrapy Cloud project and schedule the spider:

$ shub deploy
$ shub schedule blogspider

Scripts

  • bin/archive-items.py: A full-featured script to export all jobs output to a given collection. This script can be set up as a Periodic Job on Scrapy Cloud.
  • bin/dump-collection.py: A simple script to dump collection's items.

License

This project is licensed under the terms of the MIT License