wnma3mz/wechat_articles_spider

爬取会跳过很多推文怎么办?

ShixiangWang opened this issue · 5 comments

非常感谢开发这个工具,参考 https://github.com/wnma3mz/wechat_articles_spider/blob/master/test/test_ArticlesAPI.py 我爬取的时候发现会连着跳过几天的推文~不知道有什么解决办法没有。

我检查了下代码应该没有问题,会连着获取几天的推文,然后又跳过几天的推文。。。

    # 自定义爬取,每次爬取5篇以上
    start = 0
    count = 10 # 不是每次得到的都一样
    time_delay = 60 * 3

    for i in range(100):
        if i != 0:
            start += len(data)
        print("===============")
        print("Query round: " + str(i))
        print("Start set to: %d" % start)
        print()
        data = loop_query(test, nickname, start, count)
        with open('out.csv', 'a') as f:
            for j in range(len(data)):
                print("Writing wechat post: " + data[j]['title'])
                f.write(data[j]['title'] + ',' + data[j]['link'] + '\n')

不知道你loop_query函数怎么写的,没法debug

#!/usr/bin/env python3
# coding: utf-8
# link: <https://github.com/wnma3mz/wechat_articles_spider/blob/master/test/test_ArticlesAPI.py>
import os
#from pprint import pprint
from wechatarticles import ArticlesAPI
#from wechatarticles import tools
import time

if __name__ == '__main__':
    # 利用公众号获取链接,并获取阅读点赞
    official_cookie = ""
    token = ""
    appmsg_token = ""
    wechat_cookie = ""

    nickname = "优雅R"

    # 手动输入所有参数
    test = ArticlesAPI(official_cookie=official_cookie,
                       token=token,
                       appmsg_token=appmsg_token,
                       wechat_cookie=wechat_cookie)


    def loop_query(obj, nickname, begin, count, limits=None):
        if limits is None:
            limits = 100
        try:    
            data = obj.complete_info(nickname=nickname, begin=begin, count=count)
        except Exception as e:
            print("-----")
            print(e)
            print("-----")
            if limits == 0:
                print("爬取已完成或者出现了其他错误。")
                os.Exit(0)
            print("delay %d seconds" % time_delay)
            limits = limits - 1
            print("time limits: %d" % limits)
            time.sleep(time_delay)
            return loop_query(obj, nickname, begin, count, limits = limits)
        return data

    # 自定义爬取,每次爬取5篇以上
    start = 0
    count = 10 # 不是每次得到的都一样
    time_delay = 60 * 3

    for i in range(100):
        if i != 0:
            start += len(data)
        print("===============")
        print("Query round: " + str(i))
        print("Start set to: %d" % start)
        print()
        data = loop_query(test, nickname, start, count)
        with open('out.csv', 'a') as f:
            for j in range(len(data)):
                print("Writing wechat post: " + data[j]['title'])
                f.write(data[j]['title'] + ',' + data[j]['link'] + '\n')
    
    #pprint(data)
    # 自定义从某部分开始爬取,持续爬取,直至爬取失败为止,一次性最多爬取40篇(功能未测试,欢迎尝试)
    # datas = test.continue_info(nickname=nickname, begin=0)
    #tools.save_json("test.json", data)

公众号还是微信号cookie应该失效了,如果您有空可以拿自己的常用测试公众号号试试

已删除隐私cookie

这里的start,不应该+=len(data)
如:
每次count=5,返回的数量不一定为5,但是start应该是+=5

好的,谢谢,我有空试一下