/intro-to-scrapy

Introduction to web scraping using the `scrapy` module

Primary LanguagePythonOtherNOASSERTION

intro-to-scrapy

This repo provides examples of how to use the scrapy module to scrape data online. Useful things covered are:

  • A concrete example (with code!) of how to scrape babynames from huggies.co.nz
  • Lecture slides with explanations of how the code works

I wanna see the slides

The slides are available online here: https://gorbachev.io/site/dragonfly-science/intro-to-scrapy

Open code

You are welcome to fork this repository and use it as a basis for your own projects. Text content and code on this website are copyright Dragonfly Limited but are licenced for re-use under a Creative Commons International Attribution 4.0 licence (see LICENSE for terms and conditions).

Please note that this licence does not apply to any logos, emblems and trade marks on the website or to the website’s design elements or to any photography, imagery, or publications.

Copyright of those specific items may not be held by Dragonfly Limited. Unless indicated otherwise, those specific items may not be re-used without express permission.

How to run this repo on Linux or Mac

This repo is set up to run in a reproducible way using make and docker.

If you have make and docker installed, running:

make docker

make slides

Will build the docker image you need locally, scrape huggies.co.nz to create babynames/babynames.csv if it does not already exist, and then build the slides intro-to-scrapy.html.

Installing make and docker is only sure to work on a unix system (i.e. Linux or Mac) however.

How to run this repo on windows

The automation in this repo does not work on windows, but you can still build the contents of the repo by doing the following:

You can install scrapy with:

pip install scrapy

(As per usual, running python code in a virtualenv is recommended but the only python dependency of this repo is on scrapy so make your best decision)

Then once you have scrapy installed, from the babynames directory, run the following:

scrapy crawl huggies -o babynames.csv 

This will crawl huggies.co.nz to create the babynames.csv file.

If you also want to render the slides, then you need R installed as well as the following packages:

  • tidyverse
  • rmarkdown
  • kableExtra
  • reticulate

You can install these with:

install.packages(c('tidyverse', 'rmarkdown', 'kableExtra', 'reticulate'))

Once you have these installed you are ready to run:

rmarkdown::render("intro-to-scrapy.Rmd")

from the R command line, or open intro-to-scrapy.Rmd from RStudio and knit the slides together there.

Any issues?

You are welcome to report any issues with the repo, through GitHub. If you really enjoy fixing other peoples' stuff, pull requests would also be appreciated!