bowenpay/wechat-spider

添加公众号后,文章无法抓取.

Opened this issue · 4 comments

已经使用代理,代理显示检测正常,但始终无法抓取文章,log文件中也没有报异常信息.使用的代理是利用多家免费代理测试后的可用代理.

代理通过运行getproxies 和checkproxies获取.

关键词爬取正常.

wechatspider_downloader.stderr.log文件内容:

DEBUG 2017-05-16 15:26:54,391 remote_connection 30530 140135752034048 POST http://127.0.0.1:52273/session/4ab0200e-8c44-464c-84d9-10f2e4801ae3/element/b80b891d-182b-42f4-816e-1250f9e95b66/click {"sessionId": "4ab0200e-8c44-464c-84d9-10f2e4801ae3", "id": "b80b891d-182b-42f4-816e-1250f9e95b66"}
DEBUG 2017-05-16 15:26:54,439 remote_connection 30530 140135752034048 Finished Request
DEBUG 2017-05-16 15:26:57,441 remote_connection 30530 140135752034048 GET http://127.0.0.1:52273/session/4ab0200e-8c44-464c-84d9-10f2e4801ae3/window/handles {"sessionId": "4ab0200e-8c44-464c-84d9-10f2e4801ae3"}
DEBUG 2017-05-16 15:26:57,448 remote_connection 30530 140135752034048 Finished Request
DEBUG 2017-05-16 15:26:57,448 remote_connection 30530 140135752034048 POST http://127.0.0.1:52273/session/4ab0200e-8c44-464c-84d9-10f2e4801ae3/window {"sessionId": "4ab0200e-8c44-464c-84d9-10f2e4801ae3", "handle": "2147483654"}
DEBUG 2017-05-16 15:26:57,469 remote_connection 30530 140135752034048 Finished Request
DEBUG 2017-05-16 15:26:59,443 remote_connection 30529 139889887196928 Finished Request
DEBUG 2017-05-16 15:26:59,443 remote_connection 30529 139889887196928 GET http://127.0.0.1:43591/session/e16f9fec-ad62-42b3-9854-5d5049eef676/title {"sessionId": "e16f9fec-ad62-42b3-9854-5d5049eef676"}
DEBUG 2017-05-16 15:26:59,455 remote_connection 30529 139889887196928 Finished Request
DEBUG 2017-05-16 15:26:59,456 remote_connection 30529 139889887196928 POST http://127.0.0.1:43591/session/e16f9fec-ad62-42b3-9854-5d5049eef676/element {"using": "css selector", "sessionId": "e16f9fec-ad62-42b3-9854-5d5049eef676", "value": "[name=\"query\"]"}
DEBUG 2017-05-16 15:26:59,466 remote_connection 30529 139889887196928 Finished Request
DEBUG 2017-05-16 15:26:59,467 remote_connection 30529 139889887196928 POST http://127.0.0.1:43591/session/e16f9fec-ad62-42b3-9854-5d5049eef676/element/bf262294-6f32-4e78-b2cb-b914c7121fef/value {"text": "BigDataDigest\ue015", "sessionId": "e16f9fec-ad62-42b3-9854-5d5049eef676", "id": "bf262294-6f32-4e78-b2cb-b914c7121fef", "value": ["B", "i", "g", "D", "a", "t", "a", "D", "i", "g", "e", "s", "t", "\ue015"]}
DEBUG 2017-05-16 15:26:59,558 remote_connection 30529 139889887196928 Finished Request
DEBUG 2017-05-16 15:26:59,558 remote_connection 30529 139889887196928 POST http://127.0.0.1:43591/session/e16f9fec-ad62-42b3-9854-5d5049eef676/element {"using": "xpath", "sessionId": "e16f9fec-ad62-42b3-9854-5d5049eef676", "value": "//input[@value='\u641c\u516c\u4f17\u53f7']"}
DEBUG 2017-05-16 15:26:59,565 remote_connection 30529 139889887196928 Finished Request
DEBUG 2017-05-16 15:26:59,565 remote_connection 30529 139889887196928 POST http://127.0.0.1:43591/session/e16f9fec-ad62-42b3-9854-5d5049eef676/element/4a7be0b5-2f3f-4e94-add6-d8c84fbb80ab/click {"sessionId": "e16f9fec-ad62-42b3-9854-5d5049eef676", "id": "4a7be0b5-2f3f-4e94-add6-d8c84fbb80ab"}
DEBUG 2017-05-16 15:26:59,623 remote_connection 30529 139889887196928 Finished Request
DEBUG 2017-05-16 15:27:00,470 remote_connection 30530 140135752034048 POST http://127.0.0.1:52273/session/4ab0200e-8c44-464c-84d9-10f2e4801ae3/execute/sync {"sessionId": "4ab0200e-8c44-464c-84d9-10f2e4801ae3", "args": [], "script": " return document.documentElement.innerHTML; "}
DEBUG 2017-05-16 15:27:00,482 remote_connection 30530 140135752034048 Finished Request
DEBUG 2017-05-16 15:27:00,483 remote_connection 30530 140135752034048 DELETE http://127.0.0.1:52273/session/4ab0200e-8c44-464c-84d9-10f2e4801ae3/cookie {"sessionId": "4ab0200e-8c44-464c-84d9-10f2e4801ae3"}
DEBUG 2017-05-16 15:27:00,492 remote_connection 30530 140135752034048 Finished Request
DEBUG 2017-05-16 15:27:00,492 remote_connection 30530 140135752034048 DELETE http://127.0.0.1:52273/session/4ab0200e-8c44-464c-84d9-10f2e4801ae3 {"sessionId": "4ab0200e-8c44-464c-84d9-10f2e4801ae3"}
DEBUG 2017-05-16 15:27:00,516 remote_connection 30530 140135752034048 Finished Request
DEBUG 2017-05-16 15:27:00,517 abstractdisplay 30530 140135752034048 DISPLAY=:0
DEBUG 2017-05-16 15:27:00,517 __init__ 30530 140135752034048 stopping process (pid=32527 cmd="['Xvfb', '-br', '-screen', '0', '1024x768x24', ':1719']")
DEBUG 2017-05-16 15:27:00,517 __init__ 30530 140135752034048 process is active -> sending SIGTERM
DEBUG 2017-05-16 15:27:00,520 __init__ 30530 140135752034048 process has ended
DEBUG 2017-05-16 15:27:00,520 __init__ 30530 140135752034048 return code=0
DEBUG 2017-05-16 15:27:00,520 __init__ 30530 140135752034048 stdout=
DEBUG 2017-05-16 15:27:00,520 __init__ 30530 140135752034048 stderr=

这个应该是返回 输入验证码的 页面了,用代理也不好使。 现在搜狗微信的屏蔽规则有点变化,我正在修改方案

是的,我调试的结果确实是跳转到验证码了,可是使用浏览器访问是不需要输入验证码的,那是哪里出了问题?

另外,不知道你现在进展的如何?是否有考虑识别验证码的方式来采集,我看搜狗的验证码是字母的,识别起来应该不是很难.

@yijingping 我也出现这个问题,用curl验证也确实是验证码的问题。但是我在爬虫服务器用curl -x (代理)访问是没问题的,所以我猜想是不是爬虫的下载程序没有正确使用代理。