A Scrapy Zhihu Crawler
zhihu-scrapy is a distributed crawler system for crawling zhihu website.The data we gather include user profile, followees and followers.Collected data can be used for various purpose(eg. finding communities, identifying popular answer posters)
###How does it work?
It combines the following systems:
- scrapy (parsing and logging)
- selenium (downloading and executing javascript)
- redis (queueing and storing results)
The crawler system consists of one main redis server to manage crawling records. All crawling machines start a local redis server for storing user data.
###How to get started?
Start redis server on main server and crawling machines.
Add initial users to the main redis server with Monitor, example:
>> from zhihu.utils import Monitor
>> init_list = ['first-id',]
>> Monitor.add_user_ids(init_list)
In zhihu/settings.py
set REDIS_HOST
to the ip address of the main redis server.
Use scrapy crawl zhihu_people
to start a crawler.
###How to solve captchas?
We provide the Monitor class to monitor crawlers, including solving captchas for them. To solve captchas for all crawlers that need captcha, use:
>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.solve_captchas()
###How to add accounts?
Each crawler needs to fetch an account from the account pool to start. To add accounts to account pool, use:
>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.add_account('username','password')
###How to check stats?
>> from zhihu.utils import Monitor
>> m = Monitor()
>> m.stats()
###License: GPL v3