beauty - Crawling business data with a minimum of requests

Background

The current project was created to demonstrate how business data can be collected from web portals in an unobtrusive way.

For this purpose two German beauty portals are scraped with a minimum of requests.

Implementation

The project is realized in Python (3.9) and utilizes the Scrapy (2.5.0) framework. The latter offers a CrawlSpider class, in which the interaction of rules and callbacks can be used to visit each and every page on a portal.

In this project, however, it was crucial to avoid deep crawling!

The individual pages of companies on these portals were never accessed!

All business data including cosmetic services were collected from overview pages.

In the case of one portal, each request furnished roughly ten company entries. In the case of another portal, the overview pages for larger cities contained much more entries.

For each of these entries, an item was yielded and exported.

How to run

Deploy a Scrapy container
Deploy the project via scrapyd-deploy
Start a crawler using curl

Remarks

This project is a proof of principle. Both crawlers are fully functional. However, there's neither a proxy middleware, nor any database connection.

Enjoy and keep in mind that less requests aren't necessarily a bad thing ;-)

klauswarzecha/beauty

beauty - Crawling business data with a minimum of requests

Background

Implementation

How to run

Remarks