PyGotham 2016 Talk Proposal
Have you wanted to grab data from websites and automatically categorize it into a formatted list? Or maybe you want to opt out of registering for API keys and want data straight out of a web page? This is an introduction to web scraping and we will cover building bots through Scrapy to crawl a few sample web pages and have it extract information that we want. Prior knowledge not required; we’ll break down the steps in creating your own bot, and before you know it you'll be scraping the web.
This is a live-coding session and everyone is welcome to code along. We will install Scrapy and use Sublime Text to edit and write our code. Bring your laptops! After a little introduction, we will start building a bot that scrapes all the species of pine trees from a website to an organized dataset. You’ll have a whole collection of conifers by the end that can be accessed and analyzed in a structured JSON or CSV file.
The concepts in this session will make a little more sense if you have programmed before, and some knowledge of programming is recommended. You will get the most out of it if you review the code and build another web crawler bot afterwards. The material includes object-oriented programming, parsing HTML through XPath, HTML and CSS structure, and exporting CSV and JSON files.
- Web scraping is a technique used for extracting information from websites. Its main goal is to transform unstructured content from the web, usually in an HTML format, into a structured dataset that can be saved and examined in a spreadsheet or database.
- Examples: human copy-and-paste, UNIX grep paired with regex, HTTP requests, computer vision web analyzers, or web-scraping softwares
- Aggregating prices of video games: putting together a list of prices for products that you are interested in is a thrifty way to find the best deals
- Grabbing the daily weather: researchers can integrate weather data into their observations without measuring the weather with hardware tools
- Acquiring a list of conifers: this is the information we will be extracting, which is a list of known conifers in the world
- Requirements:
- Python 2.7 – Scrapy does not have full support of Python 3 at the moment, and installing Scrapy in 2.7 is the most stable
- pip – Python package management system
- lxml – Most Linux distribution already have lxml install
- OpenSSL – Comes preinstalled in all operating systems except Windows
$ pip install Scrapy
- “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival” (scrapy.org).
$ scrapy <command> -h
-
Global commands:
- startproject
- settings
- runspider
- shell
- fetch
- view
- version
-
Project-only commands
- crawl
- check
- list
- edit
- parse
- genspider
- bench
tutorial/
scrapy.cfg # project root directory
testing/ # Python module where the project is contained
__init__.py
items.py # defines item objects for structured data
pipelines.py # performs an action over item objects
settings.py # allows for further component customization
spiders/ # directory with your spiders
__init__.py
- Go to a directory you prefer
- Create a new scrapy project
$ scrapy startproject conifers
- Check the website with conifers again: http://www.greatplantpicks.org/plantlists/by_plant_type/conifer
- Notice the names and scientific names? We'll extract those.
- Open up items.py
- We will add name, genus, and species as fields to our item
import scrapy
class ConifersItem(scrapy.Item):
name = scrapy.Field()
genus = scrapy.Field()
species = scrapy.Field()
pass
- Open up spiders directory and it contains no spiders at the moment
- We will add a spider now
- Create a new file and name it conifers_spider.py inside the spiders directory
- Let's clone the web page first
import scrapy
from conifers.items import ConifersItem
class ConifersSpider(scrapy.Spider):
name = "conifers"
allowed_domains = ["greatplantpicks.org"]
start_urls = [
"http://www.greatplantpicks.org/plantlists/by_plant_type/conifer"]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
- Save conifers.py
- Now, go back to the project root directory
- Run your bot
$ scrapy crawl conifers
- You should now see by_plant_type.html in your directory
- Go back to conifers_spider.py and comment out the function parse
- We want to retrieve only the common names and scientific names
- To do this, we need to refer to create an item for each one and generate all these objects
- Add this new parse function with the old still commented
def parse(self, response):
for sel in response.xpath('//tbody/tr'):
item = ConifersItem()
item['name'] = sel.xpath('td[@class="common-name"]/a/text()').extract()
item['genus'] = sel.xpath('td[@class="plantname"]/a/span[@class="genus"]/text()').extract()
item['species'] = sel.xpath('td[@class="plantname"]/a/span[@class="species"]/text()').extract()
yield item
- Go back to root project directory and run the bot
- Let's export this as a JSON file first
$ scrapy crawl conifers -o trees_json.json
- Great, now let's export it as a csv!
$ scrapy crawl conifers -o trees_csv.csv
- Now you have extracted all the conifers-- happy trails!
Example of a Scrapy bot [dahlia](git clone https://github.com/Zovfreullia/intro_to_scrapy/tree/master/dahlia)
- I made a bot that extracted seed names and product identification numbers from Johnny Seeds
- Download the bot
$ git clone https://github.com/Zovfreullia/intro_to_scrapy/tree/master/dahlia
- Go into the directory
$ cd dahlia
- Run the file
$ scrapy crawl dahlia
- The settings.py is set to default
BOT_NAME = 'dahlia'
SPIDER_MODULES = ['dahlia.spiders']
NEWSPIDER_MODULE = 'dahlia.spiders'
- items.py defines the fields for our items
import scrapy
class DahliaItem(scrapy.Item):
name = scrapy.Field()
extendedName = scrapy.Field()
identification = scrapy.Field()
description = scrapy.Field()
- dahlia_spider.py crawls through the web using fields from items.py
import scrapy
from dahlia.items import DahliaItem
class DahliaSpider(scrapy.Spider):
name = "dahlia"
allowed_domains = ["johnnyseeds.com"]
start_urls = [
"http://www.johnnyseeds.com/v-9-greenhouse-performer.aspx?categoryid=1&pagesize=15&list=1&pagenum=9"
]
def parse(self, response):
for sel in response.xpath('//div[@class="productResultInfo"]'):
item = DahliaItem()
item['name'] = sel.xpath('a/span[@class="nameCAT"]/text()').extract()
item['extendedName'] = sel.xpath('a/span[@class="extendednameCAT"]/text()').extract()
item['identification'] = sel.xpath('h1/text()').extract()
item['description'] = sel.xpath('div[@class="productResultDesc"]/div/text()').extract()
yield item