Douban Crawler
Introduction
Douban Crawler is a scrapy crawler project for crawling movie and book information of https://douban.com.
Architecture
The architecture of this project is as follows:
* the required data of a movie contains:
- Movie Name
- Director
- Release Time
- Country
* the review data consists of:
- Movie Name
- Review Title
- Review Author
- Review Content
- Up Number
- Down Number
- Rate
The graph below shows the underlying architecture of scrapy-redis:
Features
-
Distributed Crawling: Given that the data is of overwhelming size, distributed cralwing is inevitable for our project.
-
Robustness: Douban has its anti-robots scheme, e.g. Crawl-delay: 5, therefore strategies like changing user-agent and proxy ip etc. are siginificant in our crawling practice.
License
GPL License.