/zheye-crawler

a testing crawler of zhihu which have tiny functions

Primary LanguagePythonApache License 2.0Apache-2.0

zheye-crawler

a testing crawler of zhihu which have tiny functions

Thank for Zhihu-Login created by zkqiang
Thank for Spider_Hub created by WiseDoge
Thank for Zhihu-captcha-crack-auto-login created by DueToAttitude

If there are some problems about open source License,please communicate with me to correct.

please ensure the structure of the folder you clone this Repositoriy is like below:
.
├── private.json
├── result
│   ├── en1
│   ├── en2
│   └── en_cla
├── source
│   ├── captcha_data
│   ├── font_type
│   └── readme_img
├── zheye-crawler

while the content of private.json should be like following file:

{
"user": "unknown",
"proxy_on": true,
"proxies": [
null
],
"flippagenum": 1,
"mainpage_url": "https://www.zhihu.com",
"account_url": "https://www.zhihu.com/settings/account",
"thread_num": 3,
"sleep": 1,
"mongodbnet": {
"host": "127.0.0.1",
"port": 27017
},
"stdlist": [
"github"
],
"machinenum": 0
"onlyapi": false,
"lightout": false,
"threshold": 10000,
"target_name": [
"github"
],
"target_url": [
"https://www.zhihu.com/topic/19566035/followers"
]
}

there are some tips you should take care:
"mongodbnet": {
"host": "the ip or domain where your mongodb has been established",
"port": the port mongodb used, which default is 27017
},
"machinenum": a num the machine in your quene,such as 0,1
"onlyapi": True is only get the userinfo and not extend the NewUrl, default is false,
"lightout": if True , the program will shutdown automatically on the schedule set before, default is false,
"threshold": the max page number of either followering or followers, which default is 10000,
"target_name": [
"which topic you want to get the information"
],
"target_url": [
"the followers page url of target_name, you can run gaintopicurl.py to generate it after you fill the target_name"
]

the order of the whole project is

modify the private.json

python3 gaintopicurl.py

python3 Master.py

python3 Slave.py