scrapper
scrapper is small, Python 3.3+ web scrapping library using lxml and requests.
Usage
First start with Field
, it the one that define the data you are
looking for, or more specific, it's select data from content using selector
defined by you. It takes three parameters:
selector
- it's a XPath selectorcallback
- (optional) a function that will be fired on data
Class Field
is used to define fields inside subclass of
Item
:
class AmazonEntry(scrapper.Item):
title = scrapper.Field(
"//*[@id='productTitle']/text()",
lambda value, content, response: value.strip() if value else None,
)
price = scrapper.Field(
"//span[@class='a-color-price']/text()",
lambda value, _, __: value.strip() if value else None,
)
img = scrapper.Field(
'//div[@id="imgTagWrapperId"]/img/@data-a-dynamic-image',
get_image,
)
After creating a subclass of Item
we can instantiate it and the
constructor takes following parameters:
url
- webpage on what we are going to get datacaller
- (optional) in what class we created this instancecontent
- (optional) if we already have contents of the site this is the place to pass it
Using given example we can use it like this:
product = AmazonEntry(link)
print 'title: %s\nprice: %s\nimg:%s' % (
product.title, product.price, product.img,
)
Repetitions on site
When there is more than one occurrence of data set you are looking for you, then
you should use ItemSet
. It's designed to look for repetitions in
content.
When creating a class, you have to setup two attributes:
content_selector
- it's selector, that we are going to iterate over content selected bt thisitem_class
- it's aItem
subclass that look for a data in content selected by above selector.
import scrapper
class ImgurEntry(scrapper.Item):
link = scrapper.Field('//a[@class="image-list-link"]/@href')
description = scrapper.Field(
'//a[@class="hover"]/p/text()',
lambda value, _, __: value.strip() if value else None,
)
class ImgurEntryItemSet(scrapper.ItemSet):
item_class = ImgurEntry
content_selector = '//div[@class="cards"]/div[@class="post"]'
for item in ImgurEntryItemSet('http://imgur.com/'):
print("url: %s; %s" % (item.link, item.description))
Crwaling over pages on site
Class Pagination
can be used when there is need to iterate over pages.
When
url
- starting url, on this page we will search for url addresses to process by scrapperitem_class
- class that is used for actual processing of site, need to be instance ofItem
orPagination
links_selector
- XPath selector for links to pages that should be iterated over, usually this should point toa
tags in paginatornext_selector
- XPath selector for selecting next site, will always select link from currently processed page
import scrapper
class WykopEntry(scrapper.Item):
title = scrapper.Field(
'//div[contains(@class, "lcontrast")]/h2/a/text()',
lambda value, content, response: value.strip() if value else None,
)
link = scrapper.Field(
'//div[contains(@class, "lcontrast")]/h2/a/@href',
)
class WykopEntries(scrapper.ItemSet):
item_class = WykopEntry
content_selector = '//*[@id="itemsStream"]//li[contains(@class, "link")]'
class WykopPagination(scrapper.Pagination):
url = 'http://www.wykop.pl/'
item_class = WykopEntries
links_selector = '//a[@class="button"]/@href'
for item_set in WykopPagination():
for item in item_set:
print "title: %s (%s)" % (item.title, item.link))
Controlling iteration
By default scrapper will go over pages selected by links_selector
or select
next link on currently processed page using
Iteration over pages is done in next_link
method. To manually control what
pages should be processed you need to override this method. This method must
yield single url, which is next page to crawl over.
Examples
See /examples/ for more simple usages.