zhihuquestions

A web spider for zhihu.com, which is used for zhihuquestions.
This spider can scrape question & topic data from zhihu.com.

This spider is based on zhihu-spider.

Author

Tian Gao

Run it

What do you need to run it

Python 2.7.6 (Maybe it work for other versions.)
MySQL
BeautifulSoup

How to run it

Download the code
Set up your database using MySQL
Initialize your database using init.sql
Find out your cookie of zhihu.com throught browser's developer tool.
Modify config.ini
If you set up zhihu username and cookies correctly, you may run initDB.py to get all your current focused topics into database as seeds, otherwise you can manually insert some topics in TOPIC as scrape seeds.
Use python topic.py to get topics and questions from zhihu.com
Use python question.py to analyze questions from zhihu.com
You have to use both topic.py and questions.py in rotation to make the database grow.

Warning

You can change thread amount in config.ini to make this spider run faster.
But your IP may be blocked from zhihu.com if you connect to zhihu.com too frequently.
You'd better use proxy when you use multi thread mode.

License

The MIT license.