/bookscraper

A sample Scrapy project with pagination, item loader, pipelines...

Primary LanguagePython

bookscraper

This is a Scrapy project to scrape information about books at http://books.toscrape.com/

This project is only meant for educational purposes.

Extracted data

This project extracts all data including title, price, product type etc...
A sample item:


{
    'title': 'A Light in the Attic',
    'upc': '£51.77',
    'product_type': 'Books',
    'price': '£51.77',
    'tax': '£0.00',
    'stock': 'In stock (22 available)',
    'reviews': '0',
    'rating': '3'
}

Spiders

This project contains two spiders: bookscraper-css and bookscraper-xpath. Both work the same way the first one is implemented with Css selectors the other one is implemented with xpath.

You can learn more about web scraping with Scrapy by going through the original Scrapy Tutorial or Scrapy Tutorial Series on ScrapingAuthority.com.

Pipelines

This project contains four pipelines. One processes the "rating" field. The second one filters out books that have a stock number of more than five. The other two pipelines are meant to show you how to create json and csv files from the scraped data. You can disable pipelines in settings.py.

Running the spiders

You can run a spider using the scrapy crawl command:


$ scrapy crawl bookscraper-css
$ scrapy crawl bookscraper-xpath