爬取会跳过很多推文怎么办?
ShixiangWang opened this issue · 5 comments
ShixiangWang commented
非常感谢开发这个工具,参考 https://github.com/wnma3mz/wechat_articles_spider/blob/master/test/test_ArticlesAPI.py 我爬取的时候发现会连着跳过几天的推文~不知道有什么解决办法没有。
我检查了下代码应该没有问题,会连着获取几天的推文,然后又跳过几天的推文。。。
# 自定义爬取,每次爬取5篇以上
start = 0
count = 10 # 不是每次得到的都一样
time_delay = 60 * 3
for i in range(100):
if i != 0:
start += len(data)
print("===============")
print("Query round: " + str(i))
print("Start set to: %d" % start)
print()
data = loop_query(test, nickname, start, count)
with open('out.csv', 'a') as f:
for j in range(len(data)):
print("Writing wechat post: " + data[j]['title'])
f.write(data[j]['title'] + ',' + data[j]['link'] + '\n')
wnma3mz commented
不知道你loop_query函数怎么写的,没法debug
ShixiangWang commented
#!/usr/bin/env python3
# coding: utf-8
# link: <https://github.com/wnma3mz/wechat_articles_spider/blob/master/test/test_ArticlesAPI.py>
import os
#from pprint import pprint
from wechatarticles import ArticlesAPI
#from wechatarticles import tools
import time
if __name__ == '__main__':
# 利用公众号获取链接,并获取阅读点赞
official_cookie = ""
token = ""
appmsg_token = ""
wechat_cookie = ""
nickname = "优雅R"
# 手动输入所有参数
test = ArticlesAPI(official_cookie=official_cookie,
token=token,
appmsg_token=appmsg_token,
wechat_cookie=wechat_cookie)
def loop_query(obj, nickname, begin, count, limits=None):
if limits is None:
limits = 100
try:
data = obj.complete_info(nickname=nickname, begin=begin, count=count)
except Exception as e:
print("-----")
print(e)
print("-----")
if limits == 0:
print("爬取已完成或者出现了其他错误。")
os.Exit(0)
print("delay %d seconds" % time_delay)
limits = limits - 1
print("time limits: %d" % limits)
time.sleep(time_delay)
return loop_query(obj, nickname, begin, count, limits = limits)
return data
# 自定义爬取,每次爬取5篇以上
start = 0
count = 10 # 不是每次得到的都一样
time_delay = 60 * 3
for i in range(100):
if i != 0:
start += len(data)
print("===============")
print("Query round: " + str(i))
print("Start set to: %d" % start)
print()
data = loop_query(test, nickname, start, count)
with open('out.csv', 'a') as f:
for j in range(len(data)):
print("Writing wechat post: " + data[j]['title'])
f.write(data[j]['title'] + ',' + data[j]['link'] + '\n')
#pprint(data)
# 自定义从某部分开始爬取,持续爬取,直至爬取失败为止,一次性最多爬取40篇(功能未测试,欢迎尝试)
# datas = test.continue_info(nickname=nickname, begin=0)
#tools.save_json("test.json", data)
ShixiangWang commented
公众号还是微信号cookie应该失效了,如果您有空可以拿自己的常用测试公众号号试试
wnma3mz commented
已删除隐私cookie
这里的start,不应该+=len(data)
如:
每次count=5,返回的数量不一定为5,但是start应该是+=5
ShixiangWang commented
好的,谢谢,我有空试一下