a testing crawler of zhihu which have tiny functions
Thank for Zhihu-Login created by zkqiang
Thank for Spider_Hub created by WiseDoge
Thank for Zhihu-captcha-crack-auto-login created by DueToAttitude
If there are some problems about open source License,please communicate with me to correct.
please ensure the structure of the folder you clone this Repositoriy
is like below:
.
├── private.json
├── result
│ ├── en1
│ ├── en2
│ └── en_cla
├── source
│ ├── captcha_data
│ ├── font_type
│ └── readme_img
├── zheye-crawler
while the content of
private.json
should be like following file:
{
"user": "unknown",
"proxy_on": true,
"proxies": [
null
],
"flippagenum": 1,
"mainpage_url": "https://www.zhihu.com",
"account_url": "https://www.zhihu.com/settings/account",
"thread_num": 3,
"sleep": 1,
"mongodbnet": {
"host": "127.0.0.1",
"port": 27017
},
"stdlist": [
"github"
],
"machinenum": 0
"onlyapi": false,
"lightout": false,
"threshold": 10000,
"target_name": [
"github"
],
"target_url": [
"https://www.zhihu.com/topic/19566035/followers"
]
}
there are some tips you should take care:
"mongodbnet": {
"host": "the ip or domain where your mongodb has been established",
"port": the port mongodb used, which default is 27017
},
"machinenum": a num the machine in your quene,such as 0,1
"onlyapi": True is only get the userinfo and not extend the NewUrl, default is false,
"lightout": if True , the program will shutdown automatically on the schedule set before, default is false,
"threshold": the max page number of either followering or followers, which default is 10000,
"target_name": [
"which topic you want to get the information"
],
"target_url": [
"the followers page url of target_name, you can rungaintopicurl.py
to generate it after you fill the target_name"
]
the order of the whole project is
modify the
private.json
python3 gaintopicurl.py
python3 Master.py
python3 Slave.py