A web spider for zhihu.com, which is used for zhihuquestions.
This spider can scrape question & topic data from zhihu.com.
This spider is based on zhihu-spider.
- Python 2.7.6 (Maybe it work for other versions.)
- MySQL
- BeautifulSoup
- Download the code
- Set up your database using MySQL
- Initialize your database using init.sql
- Find out your cookie of zhihu.com throught browser's developer tool.
- Modify config.ini
- If you set up zhihu username and cookies correctly, you may run initDB.py to get all your current focused topics into database as seeds, otherwise you can manually insert some topics in TOPIC as scrape seeds.
- Use
python topic.py
to get topics and questions from zhihu.com - Use
python question.py
to analyze questions from zhihu.com - You have to use both topic.py and questions.py in rotation to make the database grow.
You can change thread amount in config.ini to make this spider run faster.
But your IP may be blocked from zhihu.com if you connect to zhihu.com too frequently.
You'd better use proxy when you use multi thread mode.
The MIT license.