/intro_to_scrapy

PyGotham 2016 Talk Proposal

Primary LanguageHTML

Introduction to Web Scraping using Scrapy

PyGotham 2016 Talk Proposal

Description:

Have you wanted to grab data from websites and automatically categorize it into a formatted list? Or maybe you want to opt out of registering for API keys and want data straight out of a web page? This is an introduction to web scraping and we will cover building bots through Scrapy to crawl a few sample web pages and have it extract information that we want. Prior knowledge not required; we’ll break down the steps in creating your own bot, and before you know it you'll be scraping the web.

Abstract:

This is a live-coding session and everyone is welcome to code along. We will install Scrapy and use Sublime Text to edit and write our code. Bring your laptops! After a little introduction, we will start building a bot that scrapes all the species of pine trees from a website to an organized dataset. You’ll have a whole collection of conifers by the end that can be accessed and analyzed in a structured JSON or CSV file.

The concepts in this session will make a little more sense if you have programmed before, and some knowledge of programming is recommended. You will get the most out of it if you review the code and build another web crawler bot afterwards. The material includes object-oriented programming, parsing HTML through XPath, HTML and CSS structure, and exporting CSV and JSON files.

Outline:

Length: 25 minutes

I. Introduction to web scraping and installing Scrapy (Total: 5 minutes)

A. What is web scraping? (2 minutes)
  • Web scraping is a technique used for extracting information from websites. Its main goal is to transform unstructured content from the web, usually in an HTML format, into a structured dataset that can be saved and examined in a spreadsheet or database.
  • Examples: human copy-and-paste, UNIX grep paired with regex, HTTP requests, computer vision web analyzers, or web-scraping softwares
B. Real world examples of web scraping (1 minute)
C. Installing Scrapy (2 minute)
  • Requirements:
    • Python 2.7 – Scrapy does not have full support of Python 3 at the moment, and installing Scrapy in 2.7 is the most stable
    • pip – Python package management system
    • lxml – Most Linux distribution already have lxml install
    • OpenSSL – Comes preinstalled in all operating systems except Windows
	$ pip install Scrapy

II. Scrapy tools and basic structure (Total: 5 minutes)

A. What is Scrapy? (1 minutes)
  • “Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival” (scrapy.org).
B. List of Scrapy command-line tool (2 minutes)
$ scrapy <command> -h 
  • Global commands:

    • startproject
    • settings
    • runspider
    • shell
    • fetch
    • view
    • version
  • Project-only commands

    • crawl
    • check
    • list
    • edit
    • parse
    • genspider
    • bench
C. Structure of Scrapy (2 minutes)
tutorial/
    scrapy.cfg            # project root directory 
    testing/              # Python module where the project is contained
        __init__.py
        items.py          # defines item objects for structured data
        pipelines.py      # performs an action over item objects
        settings.py       # allows for further component customization
        spiders/          # directory with your spiders
            __init__.py

III. Building a Scrapy bot to extract conifer plants (Total: 15 minutes)

A. Creating a new project (1 minute)
  • Go to a directory you prefer
  • Create a new scrapy project
	$ scrapy startproject conifers
B. Defining field items in items.py (4 minutes)
	import scrapy
	
	class ConifersItem(scrapy.Item):
	    name = scrapy.Field()
	    genus = scrapy.Field()
	    species = scrapy.Field()
	    pass
C. Building the bot (4 minutes)
  • Open up spiders directory and it contains no spiders at the moment
  • We will add a spider now
  • Create a new file and name it conifers_spider.py inside the spiders directory
  • Let's clone the web page first
	import scrapy
	from conifers.items import ConifersItem
	
	class ConifersSpider(scrapy.Spider):
		name = "conifers"
		allowed_domains = ["greatplantpicks.org"]
		start_urls = [
		"http://www.greatplantpicks.org/plantlists/by_plant_type/conifer"]
	
		def parse(self, response):
			filename = response.url.split("/")[-2] + '.html'
			with open(filename, 'wb') as f:
				f.write(response.body)
  • Save conifers.py
  • Now, go back to the project root directory
  • Run your bot
	$ scrapy crawl conifers
  • You should now see by_plant_type.html in your directory
  • Go back to conifers_spider.py and comment out the function parse
D. Extracting HTML elements using XPath and CSS selectors (4 minutes)
  • We want to retrieve only the common names and scientific names
  • To do this, we need to refer to create an item for each one and generate all these objects
  • Add this new parse function with the old still commented
	def parse(self, response):
		for sel in response.xpath('//tbody/tr'):
			item = ConifersItem()
			item['name'] = sel.xpath('td[@class="common-name"]/a/text()').extract()
			item['genus'] = sel.xpath('td[@class="plantname"]/a/span[@class="genus"]/text()').extract()
			item['species'] = sel.xpath('td[@class="plantname"]/a/span[@class="species"]/text()').extract()
			yield item
  • Go back to root project directory and run the bot
E. Running the bot we built and exporting the data as a csv and JSON file (2 minutes)
  • Let's export this as a JSON file first
	$ scrapy crawl conifers -o trees_json.json
  • Great, now let's export it as a csv!
	$ scrapy crawl conifers -o trees_csv.csv
  • Now you have extracted all the conifers-- happy trails!

Additional Notes:

Example of a Scrapy bot [dahlia](git clone https://github.com/Zovfreullia/intro_to_scrapy/tree/master/dahlia)

A. Running the bot
  • I made a bot that extracted seed names and product identification numbers from Johnny Seeds
  • Download the bot
	$ git clone https://github.com/Zovfreullia/intro_to_scrapy/tree/master/dahlia
  • Go into the directory
	$ cd dahlia
  • Run the file
	$ scrapy crawl dahlia
B. Looking through the code
  • The settings.py is set to default
	BOT_NAME = 'dahlia'
	
	SPIDER_MODULES = ['dahlia.spiders']
	NEWSPIDER_MODULE = 'dahlia.spiders'
  • items.py defines the fields for our items
	import scrapy
	
	class DahliaItem(scrapy.Item):
	    name = scrapy.Field()
	    extendedName = scrapy.Field()
	    identification = scrapy.Field()
	    description = scrapy.Field()
  • dahlia_spider.py crawls through the web using fields from items.py
	import scrapy
	
	from dahlia.items import DahliaItem
	
	class DahliaSpider(scrapy.Spider):
		name = "dahlia"
		allowed_domains = ["johnnyseeds.com"]
		start_urls = [
		"http://www.johnnyseeds.com/v-9-greenhouse-performer.aspx?categoryid=1&pagesize=15&list=1&pagenum=9"
		]
	
		def parse(self, response):
			for sel in response.xpath('//div[@class="productResultInfo"]'):
				item = DahliaItem()
				item['name'] = sel.xpath('a/span[@class="nameCAT"]/text()').extract()
				item['extendedName'] = sel.xpath('a/span[@class="extendednameCAT"]/text()').extract()
				item['identification'] = sel.xpath('h1/text()').extract()
				item['description'] = sel.xpath('div[@class="productResultDesc"]/div/text()').extract()
	
				yield item