AmSPy – Crawler and Scraper for Amazon eBook Listings
AmSPy is a Python crawler and scraper for Amazon eBook listings. It is built on top of the Scrapy framework and provides three simple spiders:
-
BasicBookSpy: Basic book page scraper. Either scrapes a single page for which a single ASIN is specified as a command line parameter (
-a asin=...
) or a list of ASINs is provided in a text file (via-a infile=...
) -
Top100Spy: Crawls top 100 books for a given Amazon catergory and retrieves their overall Kindle eBook sales ranks (plus other book data retrieved by
BasicBookSpy
. Eithercatid
andcategory
or aninfile
need to be specified when calling the spider via-a
command line option.catid
is the 9-10 digit number in the Amazon URL of a category's top 100 listing (used here to construct the URL).category
is a decriptive string used to name output files. To crawl multiple categories a whitespace separated list of category decsriptors and catids can be provided via-a infile=...
. Uses a custom pipelineTop100Pipeline
inamspy/pipelines.py
to post-process and combine data from Top 100 listing and individual book pages. -
AlsoSpy: Will scrape "also bought" titles for each book page in
start_urls
.start_urls
are determined from either an ASIN specified as-a
command line parameters (-a asin=...
) or from a file with a list of ASINs, provided (via-a infile=...
). Maximum depth to with to follow also-boughts should be defined by-s DEPTH_LIMIT=<number>
when calling the spider (otherwise DEPTH_LIMIT value in settings.py will be applied).
The data scraped from each book listing contains the following:
{'also_boughts': [{'asin': 'B00XEWHNYM',
'title_str': 'Sailing-Impunity-Adventure-South-Pacific-ebook',
'url': 'https://www.amazon.com/Sailing-Impunity-Adventure-South-Pacific-ebook/dp/B00XEWHNYM'},
{'asin': 'B01BHW58LU',
'title_str': 'This-hemispheres-people-Jackie-Parry-ebook',
'url': 'https://www.amazon.com/This-hemispheres-people-Jackie-Parry-ebook/dp/B01BHW58LU'},
{'asin': 'B012BYBDD0',
'title_str': 'Get-Real-Gone-Become-Forever-ebook',
'url': 'https://www.amazon.com/Get-Real-Gone-Become-Forever-ebook/dp/B012BYBDD0'},
{'asin': 'B01G9Y2O2M',
'title_str': 'Storm-Proofing-your-Boat-Gear-ebook',
'url': 'https://www.amazon.com/Storm-Proofing-your-Boat-Gear-ebook/dp/B01G9Y2O2M'},
{'asin': 'B011PPNIRA',
'title_str': 'Around-World-Six-Years-circumnavigation-ebook',
'url': 'https://www.amazon.com/Around-World-Six-Years-circumnavigation-ebook/dp/B011PPNIRA'},
{'asin': 'B00U01QTIQ',
'title_str': 'Stress-free-Sailing-Single-Short-handed-Techniques-ebook',
'url': 'https://www.amazon.com/Stress-free-Sailing-Single-Short-handed-Techniques-ebook/dp/B00U01QTIQ'}],
'asin': u'B00RAD0W30',
'authors': [u'Nadine Slavinski', u'Markus Schweitzer'],
'avrg_rating': 4.7,
'file_size': 12719,
'item_type': 'book_page',
'num_reviews': 34,
'price': 6.99,
'print_length': 389,
'pub_date': u'February 10, 2015',
'rank': {u'Books > Travel > Australia & South Pacific > General': 341,
u'Kindle Store > Kindle eBooks > Nonfiction > Sports > Water Sports > Sailing': 206,
u'Kindle Store > Kindle eBooks > Nonfiction > Travel > Australia & South Pacific': 144,
u'Paid in Kindle Store': 416903},
'url': 'https://www.amazon.com/Pacific-Crossing-Notes-Sailors-Coconut-ebook/dp/B00RAD0W30'}
Currently runs under Python 2.7. Requires Scrapy and Pandas. MIT license. Use responsibly – complying with Amazon's Conditions of Use is your responsibility.