头条接口结构有部分变动-2018年8月1日

Question

头条接口结构有部分变动-2018年8月1日

Switch-vov opened this issue 6 years ago · 22 comments

下面是变动后的实现代码

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool


def get_page(offset):
    params = {
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'cur_tab': '1',
        'from': 'search_tab'
    }
    base_url = 'https://www.toutiao.com/search_content/?'
    url = base_url + urlencode(params)
    try:
        resp = requests.get(url)
        if codes.ok == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None


def get_images(json):
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('cell_type') is not None:
                continue
            title = item.get('title')
            images = item.get('image_list')
            for image in images:
                yield {
                    'image': 'https:' + image.get('url'),
                    'title': title
                }


def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        os.makedirs(img_path)
    try:
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(),
                file_suffix='jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to Save Image，item %s' % item)


def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)


GROUP_START = 0
GROUP_END = 7

if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()

lifoxin commented 6 years ago

666

Answer 1 · 2018-08-21T09:15:38.000Z

突然发现您写的书上分析Ajax爬头条街拍的内容和头条网站上的内容对不上，搞了好半天都没搞出来，还好来这看了一下，要不然我还以为我买到假书了呢！

Answer 2 · 2018-08-21T09:19:30.000Z

@guaiguaidashu 书上有部分案例，已经不能正确运行了，需要自行验证。

Answer 3 · 2018-08-21T14:11:42.000Z

谢谢您的指正和贡献 @Switch-vov

Answer 4 · 2018-08-22T09:21:45.000Z

作者，你好，这个更新的程序也只能抓到部分图片

Answer 5 · 2018-08-22T09:24:19.000Z

@FLyingLSJ 这个我倒是没关注过，那时候测试爬了1000+张。
可以加些log，看看。

Answer 6 · 2018-08-29T07:05:12.000Z

作者您好！刚试着跑了您的代码，发现部分图片未下载全，如：街拍杭州：长得好看的女孩，学会了时尚的穿搭技巧，那就是女神！https://www.toutiao.com/a6594756719119696397/ 只抓到前四张图片。看了网页代码，image_list里只有前四张图的url，图5-6的链接没有在image_list里。

Answer 7 · 2018-08-29T07:22:25.000Z

请问get_images（）里的
if item.get('cell_type') is not None:
continue
cell_type是什么，在源码里没找到

Answer 8 · 2018-08-29T07:32:50.000Z

reply_id framelei
搜索“街拍“后点“综合”，再查看XHR代码，找到data标签，后面括号里就有cell_type

Answer 9 · 2018-08-29T08:34:56.000Z

reply_id YeeChe
感谢您的回复，在综合里成功找到了cell_type。原来含有cell_type的data没有‘title’、‘image_list’，并且没有在页面显示，所以才continue

Answer 10 · 2018-09-03T06:11:20.000Z

@Switch-vov ，这个更新的程序也只能抓到部分图片？？？？

Answer 11 · 2018-09-03T06:12:47.000Z

@goodbad3 是的

Answer 12 · 2018-09-03T06:26:31.000Z

@Switch-vov，会改吗？？大佬

Answer 13 · 2018-09-03T09:34:30.000Z

@goodbad3
博主的代码也只是一个案例而已，我改动的也只是发现跑不通，稍微修改了一下。
并没有花很多时间去调试、优化。

博主写的书整本通读下来收获是很大的，没必要拘泥于一个点。

Answer 14 · 2018-09-15T14:20:13.000Z

我点进去每个图集，去取得了图片的address，有些慢，但是可以取到完整的图片，feel free to copy
https://github.com/KunXie/WebCrawlingPractice/blob/master/crawl_touTiaoJiePai/main.py

Answer 15 · 2018-09-20T13:47:08.000Z

您好，现在状态码为200，返回数据为空，这是添加的加密吗？

Answer 16 · 2018-10-11T12:20:38.000Z

案例只抓取了页面出现的四幅图，通过json 的image_list 字段获取，没有去查找组图中单个图的url过程。

Answer 17 · 2018-12-02T01:42:12.000Z

reply_id framelei
搜索“街拍“后点“综合”，再查看XHR代码，找到data标签，后面括号里就有cell_type

“”综合“”在哪里？

Answer 18 · 2019-01-05T07:30:15.000Z

我试了下，抓出来的都是小图标。。，不是原始图片，而且最后还报错了
OSError: [Errno 22] 文件名、目录名或卷标语法不正确。: 'img\街拍 || 初秋天气多变，传媒美少女演绎潮流穿搭',

get_image: image.get("url").replace("list", "large")
title : if “||” in title: continue , 或者用其他字符替换掉

Answer 19 · 2019-04-07T12:38:20.000Z

小姐姐凹凸有致的身材，完美撑起了这条露腰裙...
我尝试发现了一个小问题，如果title的后面如果带有...那么在自动生成文件夹名的时候...会被去掉，然后再通过生成原本的带有...的title去保存图片，会报找不到该文件夹。

save_image ： img_path = 'img' + os.path.sep +image.get("title").replace(".","")

Answer 20 · 2019-04-11T09:31:55.000Z

能不能把每个代码的注释给添加上去啊

Answer 21 · 2019-07-15T08:04:57.000Z

作者您好，一直在跟着崔书学，但对这个练习里面的save_image函数的功能不是很明白，能麻烦注释一下吗？或者大概讲一下里面每步的作用。目前我基本按你的代码运行成功了，但在本地却找不到图片下载的文件夹，对代码里面的file_path也不是很清楚