Basasuya/Douban_Crawler

创新实践豆瓣爬虫

PythonGPL-3.0

Douban Crawler

Introduction

Douban Crawler is a scrapy crawler project for crawling movie and book information of https://douban.com.

Architecture

The architecture of this project is as follows:

* the required data of a movie contains:

Movie Name
Director
Release Time
Country

* the review data consists of:

Movie Name
Review Title
Review Author
Review Content
Up Number
Down Number
Rate

The graph below shows the underlying architecture of scrapy-redis:

Features

Distributed Crawling: Given that the data is of overwhelming size, distributed cralwing is inevitable for our project.
Robustness: Douban has its anti-robots scheme, e.g. Crawl-delay: 5, therefore strategies like changing user-agent and proxy ip etc. are siginificant in our crawling practice.

License

GPL License.