/docker-scrapy-dev

Simple dev environment for running scrapy spiders in a shell

Primary LanguagePython

Build

docker build -t theshowthing/scrapy-dev .

Run

Run as an interactive shell/environment that includes files from your present working directory:

docker run -it --rm -v $(pwd):/code -w /code -e "SUBMIT_SHOWBILLS=0" -e "PYTHONDONTWRITEBYTECODE=1" theshowthing/scrapy-dev bash

or, on Windows 10 in Powershell:

docker run -it --rm -v ${PWD}:/code -w /code -e "SUBMIT_SHOWBILLS=0" -e "PYTHONDONTWRITEBYTECODE=1" theshowthing/scrapy-dev bash

  • -it gives you an interactive terminal in the commandline
  • --rm stops/removes the docker container after you exit it
  • -w sets working directory for container
  • -v creates a volume for the container (sets /code in the container to whatever is in your present working dir)
  • -e "SUBMIT_SHOWBILLS=0" sets an environment variable used by the Spiders repo. If set to 0 then no showbills will be submitted to Engine
  • -e "PYTHONDONTWRITEBYTECODE=1" sets an environment variable used by Python. This prevents it from generating .pyc files (which can get in the way of development)
  • tst/scrapy-dev is the label of the docker image we are running
  • bash is the bash shell, used for executing commands :)

Useful Dev commands once you're in

scrapy parse http://first-avenue.com/event/2016/07/transmission-princetribute -c parse_show --pipelines

This runs a particular URL through scrapy (it attempts to auto-detect which spider to run). In this case the -c argument is used to tell it specifically which function to use on the spider (since this is a multiple-level spider and we are feeding it a lower level URL). --pipelines is used to tell it to run the resulting items through the pipelines, so we can see logged JSON content etc.. or optionally send it as a test case to Engine depending on environment variable SUBMIT_SHOWBILLS as described above.

scrapy crawl firstave

This runs the specified spider (here, firstave, as defined in firstave.py in the Spiders repo) including pipelines.

scrapy shell http://first-avenue.com/event/2016/07/transmission-princetribute

Opens up the scrapy shell. You can run commands like response.css('div.whatever::text').extract() in the shell and immediately see what is returned. Very useful for figuring out proper selectors and transformations during development.

scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json

This is just a simple example spider included with this container, to verify that scrapy itself is working.