结合了遍历搜索页和图集的一个代码，做一个分享

Question

结合了遍历搜索页和图集的一个代码，做一个分享

siuszy opened this issue 4 years ago · 23 comments

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool
import re
import random

def get_page(offset):
    headers = {
        'cookie': 'tt_webid=6787304267841324551; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6787304267841324551; csrftoken=6c8d91e61b7db691bfa45021a0e7e511; UM_distinctid=16ffb7f1fcfec-0d6566ad15973e-396a4605-144000-16ffb7f1fd02eb; s_v_web_id=k631j38t_qMPq6VOD_jioN_4lgi_BQB1_GGhAVEKoAmXJ; __tasessionId=ocfmlmswt1580527945483',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    params = {
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url = base_url + urlencode(params)
    # print(url)
    try:
        resp = requests.get(url, headers=headers)
        if 200  == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None
    
def get_images(json):
    headers = {
        'cookie': 'tt_webid=6787304267841324551; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6787304267841324551; csrftoken=6c8d91e61b7db691bfa45021a0e7e511; UM_distinctid=16ffb7f1fcfec-0d6566ad15973e-396a4605-144000-16ffb7f1fd02eb; s_v_web_id=k631j38t_qMPq6VOD_jioN_4lgi_BQB1_GGhAVEKoAmXJ; __tasessionId=ocfmlmswt1580527945483',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('title') is None: # 刨掉前部分无关内容
                continue
            title = re.sub('[\t]', '', item.get('title')) # 获取标题
            url = item.get("article_url")  #获取子链接
            if url == None:
                continue
            try:
                resp = requests.get(url,headers=headers)
                if 200  == resp.status_code:
                    images_pattern = re.compile('JSON.parse\("(.*?)"\),\n',re.S)
                    result = re.search(images_pattern,resp.text)
                    if result == None: # 非图集形式
                        images = item.get('image_list')
                        for image in images:
                            origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url')) # 改成origin/pgc-image是原图
                            yield {
                                'image': origin_image,
                                'title': title
                            }
                    else: # 图集形式 抓取gallery下json格式数据
                        url_pattern=re.compile('url(.*?)"width',re.S)
                        result1 = re.findall(url_pattern,result.group(1))
                        for i in range(len(result1)):
                            yield{
                                'image': "http://p%d.pstatp.com/origin/pgc-image/"%(random.randint(1,10)) + 
                                           result1[i][result1[i].rindex("u002F")+5:result1[i].rindex('\\"')], #存储url
                                'title': title
                            }
            except requests.ConnectionError: # 打开子链接失败就直接保存图集中前部分
                for image in images:
                    origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url')) # 改成origin/pgc-image是原图
                    yield {
                        'image': origin_image,
                        'title': title
                    }
                
def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        os.makedirs(img_path) # 生成目录文件夹
    try:
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(), 
                file_suffix='jpg')  # 单一文件的路径
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content) 
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except Exception as e:
        print(e)
        
def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        save_image(item)

if __name__ == '__main__':
    '''
    for i in range(3):
        main(20*i)
    '''
    pool = Pool()
    groups = ([x * 20 for x in range(0, 3)])
    pool.map(main, groups)

代码主体参考了@Anodsaber的版本以及崔老师的视频讲解！加了一点点自己学习过程中的注释。

主要改动是在get_images函数中，增加了一个逻辑判断，如果打开搜索页的子链接后发现是以JSON格式加载的图集，就继续抓取url并返回一个迭代器；如果是直接加载出图集的形式，就按照固有的操作，即在搜索页中抓取image的url。另外设置了异常处理部分，如果打开子链接出现问题，那么就仍然使用搜索页中抓取的模式。

在抓取图集的子链接中发现子链接形式如下 http://p1.pstatp.com/origin/pgc-image/path ，其中p1部分的数字会随机跳动，自己不是很理解这之中的原理，也是出于谨慎加了一个随机抽取一个0-9的自然数构造url的小细节。

小白第一次发帖，多多指教！

Answer 1 · 2020-02-09T18:49:30.000Z

感谢更新，已经添加到 Issue！

Answer 2 · 2020-02-16T09:02:39.000Z

图集形势下的p后面的数字大佬是怎么想出来是10以内的随机数的，是真的牛皮

Answer 3 · 2020-02-16T12:21:32.000Z

老哥，为啥我运行了一下，啥东西都没有

Answer 4 · 2020-02-17T03:13:00.000Z

老哥，为啥我运行了一下，啥东西都没有

把COOKIE改一下

Answer 5 · 2020-02-17T04:00:16.000Z

谢了老哥，运行成功了，我之前没加cookie。还有个问题，就是他参数里面还有时间戳timestamp，这个参数不用管吗

Answer 6 · 2020-02-17T06:07:53.000Z

@knownothingdog 加入timestamp可以保证每次抓取都是一个时刻回看的图集便于测试实际上不加也没事 toutiao里的timestamp就是爬虫运行的当下时间

Answer 7 · 2020-02-21T02:48:25.000Z

@siuszy 太厉害了！运行亲测有效！只是有一事不明， http://p1.pstatp.com/origin/pgc-image/path 这个链接形式是怎么找到的呀，想请教一下老哥的方法~

Answer 8 · 2020-02-22T09:07:42.000Z

@hannzhao 过奖，右键打开图片之后发现的，应该是toutiao储存图片/页面显示图片的一个api

Answer 9 · 2020-02-22T09:17:34.000Z

@A1bertY 不知道你说的是不是下面这种方法如果能直接获取cookies并且cookies的时效性较长感觉直接加在headers里也是一个比较简洁的操作

ses=requests.Session()
c = requests.cookies.RequestsCookieJar()
for item in cookies:
    c.set(item["name"],item["value"])
ses.cookies.update(c)
ses=requests.Session()

Answer 10 · 2020-02-24T15:13:27.000Z

同问如何用代码拿到正确的cookies，试过普通的requests.get和session，但拿到的cookies值明显的缺失。
{'tt_webid': '6797025835590174215'}
@Germey @siuszy

Answer 11 · 2020-02-25T15:11:54.000Z

就是其实把并没有把一个词条下所有的图片的爬取下来，我找了一晚上几乎都是的，图片并不全呀，有些博文根据image-list来的，可是这个标签下并没有包括所有图片。点进去一个词条，network里找图片的链接找不到，好像被编码了，新手求教，或许可以稍微指点下~

Answer 12 · 2020-02-27T03:54:54.000Z

2020年2月27日更新 @siuszy 的代码,本次主要有以下更新:
1.利用selenium自动获取cookies;
2. 优化代码流程，省去了一些不必要的代码;
3.利用正则表达式来筛选路径,更加简单优雅.

import requests,re,os
from hashlib import md5
from selenium import webdriver

def get_cookies(url):
    str=''
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    browser = webdriver.Chrome(options=options)
    browser.get(url)
    for i in browser.get_cookies():
        try:
            name=i.get('name')
            value=i.get('value')
            str=str+name+'='+value+';'
        except ValueError as e:
            print(e)
    return str

def get_page(offset):
    params = {
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    url='https://www.toutiao.com/api/search/content/'
    try:
        r=requests.get(url,params=params,headers=headers)
        if r.status_code==200:
            return r.json()
        else:
            print('requests get_page error!')
    except requests.ConnectionError:
        return None

def get_images(json):
    data=json.get('data')
    if data:
        for i in data:
            if i.get('title'):
                title=re.sub('[\t]','',i.get('title'))
                url=i.get('article_url')
                if url:
                    r=requests.get(url,headers=headers)
                    if r.status_code==200:
                        images_pattern = re.compile('JSON.parse\("(.*?)"\),\n', re.S)
                        result = re.search(images_pattern, r.text)
                        if result:
                            b_url='http://p3.pstatp.com/origin/pgc-image/'
                            up=re.compile('url(.*?)"width',re.S)
                            results=re.findall(up,result.group(1))
                            if results:
                                for result in results:
                                    yield {
                                        'title':title,
                                        'image':b_url+re.search('F([^F]*)\\\\",',result).group(1)
                                    }
                        else:
                            images = i.get('image_list')
                            for image in images:
                                origin_image = re.sub("list.*?pgc-image", "large/pgc-image",
                                                      image.get('url'))  # 改成origin/pgc-image是原图
                                yield {
                                    'image': origin_image,
                                    'title': title
                                }

def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        os.makedirs(img_path) # 生成目录文件夹
    try:
        resp = requests.get(item.get('image'))
        if requests.codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(),
                file_suffix='jpg')  # 单一文件的路径
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except Exception as e:
        print(e,'none123')

def main(offset):
    a = get_page(offset)
    for i in get_images(a):
        save_image(i)

cookies = get_cookies('https://www.toutiao.com')
headers = {
    'cookie': cookies,
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
    'x-requested-with': 'XMLHttpRequest',
    'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
}

if __name__=='__main__':
    #p.map(main,[0]) #之所以不用Pool多进程是因为目前还没有办法实现跨进程共享Cookies
    #map(main,[x*20 for x in range(3)]) map没有输出，不知道为什么
    for i in [x*20 for x in range(3)]:
        main(i)

Answer 13 · 2020-02-29T02:32:11.000Z

2020年2月29日更新:

1.实现了多进程共享cookies下载
2.将相关函数归类,美化代码
3.更改了文件夹命名逻辑,去除了不规范字符
4.增加了程序运行时间统计

特点: 通过访问头条首页拿到cookies的数据, 之后都用模拟表单提交的方式获取数据, 相对来说比完全使用模拟浏览器的方法速度要快一些,

不足: 由于本人的js逆向水平不够, 没有办法完全解析出cookie中的数据, 因此采用了此方法

注意事项: 由于本程序使用了selenium,因此需要安装chrome浏览器.

import requests,re,os
from hashlib import md5
from multiprocessing.pool import Pool
from selenium import webdriver
from functools import partial
import time

class Getcookie(object):
    def __init__(self,url):
        self.cookies = self.get_cookies(url)
        self.headers = {
            'cookie': self.cookies,
            'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest',
            'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
        }
    def get_cookies(self,url):
        str=''
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        browser = webdriver.Chrome(options=options)
        browser.get(url)
        for i in browser.get_cookies():
            try:
                name=i.get('name')
                value=i.get('value')
                str=str+name+'='+value+';'
            except ValueError as e:
                print(e)
        return str

    def get_page(self,offset):
        params = {
            'aid': '24',
            'app_name': 'web_search',
            'offset': offset,
            'format': 'json',
            'keyword': '街拍',
            'autoload': 'true',
            'count': '20',
            'en_qc': '1',
            'cur_tab': '1',
            'from': 'search_tab',
            'pd': 'synthesis',
        }
        url='https://www.toutiao.com/api/search/content/'
        try:
            r=requests.get(url,params=params,headers=self.headers)
            if r.status_code==200:
                return r.json()
            else:
                print('requests get_page error!')
        except requests.ConnectionError:
            return None

    def get_images(self,json):
        data=json.get('data')
        if data:
            for i in data:
                if i.get('title'):
                    title=re.sub('[\t]','',i.get('title'))
                    url=i.get('article_url')
                    if url:
                        r=requests.get(url,headers=self.headers)
                        if r.status_code==200:
                            images_pattern = re.compile('JSON.parse\("(.*?)"\),\n', re.S)
                            result = re.search(images_pattern, r.text)
                            if result:
                                b_url='http://p3.pstatp.com/origin/pgc-image/'
                                up=re.compile('url(.*?)"width',re.S)
                                results=re.findall(up,result.group(1))
                                if results:
                                    for result in results:
                                        yield {
                                            'title':title,
                                            'image':b_url+re.search('F([^F]*)\\\\",',result).group(1)
                                        }
                            else:
                                images = i.get('image_list')
                                for image in images:
                                    origin_image = re.sub("list.*?pgc-image", "large/pgc-image",
                                                          image.get('url'))  # 改成origin/pgc-image是原图
                                    yield {
                                        'image': origin_image,
                                        'title': title
                                    }

def save_image(item):
    img_path = 'img' + os.path.sep + ''.join(re.findall(r'[\u4e00-\u9fa5a-zA-Z0-9]+',item.get('title'),re.S)) #去除不能作为文件名的字符
    if not os.path.exists(img_path):
        os.makedirs(img_path) # 生成目录文件夹
    try:
        resp = requests.get(item.get('image'))
        if requests.codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(),
                file_suffix='jpg')  # 单一文件的路径
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except Exception as e:
        print(e,'none123')

def main(offset,getcookie):
    a = getcookie.get_page(offset)
    for i in getcookie.get_images(a):
        save_image(i)


if __name__=='__main__':
    start=time.perf_counter()
    getcookie=Getcookie('https://www.toutiao.com')
    p_work=partial(main,getcookie=getcookie)
    p=Pool()
    groups=[x * 20 for x in range(0, 3)]
    p.map(p_work,groups)
    end=time.perf_counter()
    print('程序运行时间: '+str(end-start)+'秒')

Answer 14 · 2020-03-10T05:09:51.000Z

2020年3月10号更新：
新手初学爬虫，程序简陋，若有大神请给点建议，谢谢！！！

# 运行前请更换你自己的cookie(手动打开浏览器搜索今日头条并登录，然后复制你自己的cookie)，
# 然后还需确定你搜索的关键词和一级文件夹是否已经创建好(文件夹有4处需要替换(path)，关键词1处)，
# 这里默认翻6页，所以还需确认你搜索的是否够翻6次（这个是有底线的，一直往下划，查看json能加载多少个(有些只能加载到offset=180，也就是9次)）
# 因时间问题，不做动态输入搜索关键词和一级文件夹，有需要修改代码即可
import requests
import re
from urllib import request
import os
import random

headers = {
    'cookie': 'csrftoken=9494ae34e9e75cdfdf6a977b9834f1c2; tt_webid=6801649363082216967; ttcid=b6f21a882d6a422a8bcf2fe3dd044d9a31; WEATHER_CITY=%E5%8C%97%E4%BA%AC; SLARDAR_WEB_ID=f32ac3e9-b215-4984-bcf8-1db3a8a187cf; tt_webid=6801649363082216967; sso_uid_tt=4e51ff47e369b88c05102c33158b3bd5; sso_uid_tt_ss=4e51ff47e369b88c05102c33158b3bd5; toutiao_sso_user=fda05989c553dfe23cb314a273195daf; toutiao_sso_user_ss=fda05989c553dfe23cb314a273195daf; sid_guard=4f1c7eea3c3edd896413f228cde9e367%7C1583633831%7C5184000%7CThu%2C+07-May-2020+02%3A17%3A11+GMT; uid_tt=41a95671a8e5d4f4300b1cf1263a84d9; uid_tt_ss=41a95671a8e5d4f4300b1cf1263a84d9; sid_tt=4f1c7eea3c3edd896413f228cde9e367; sessionid=4f1c7eea3c3edd896413f228cde9e367; sessionid_ss=4f1c7eea3c3edd896413f228cde9e367; s_v_web_id=verify_k7jttqgy_gtUHQcJQ_tg1G_4JaZ_8YOI_dEvuza9I76JH; __tasessionId=o9xyn362l1583745314873; tt_scid=fCtjtxzJ5.oygNq9E6MKnA0Luj.RM5YlI64ZqlBkqzIlApWDsGVVHTOiw-12jTtPa222',
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
}

# 避免文件名含有不合法字符
def correct_title(title):
    # 文件名最好不要含有.，否则有的系统会报错
    error_set = ['/', '\\', ':', '*', '?', '"', '|', '<', '>','.']
    for c in title:
        if c in error_set:
            title = title.replace(c, '')
    return title


def get_article_url(url,offset):
    article_urls = []
    gallery_article_url =[]
    params = {
        'aid': '24',
        'app_name': 'web_search',
        # 控制翻页的参数
        'offset': offset,
        'format': 'json',
        # 搜索图片的关键词
        'keyword': '章若楠',
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    response = requests.get(url=url,headers=headers,params=params).json()
    for i in response['data']:
        # 含有文章链接的data和文章里面不允许有视频
        # get取值如果该键名则会返回None,中括号取值如果没有键名则会报错
        if i.get('article_url') is not None and i.get('has_video') == False:
            # 手动轮播图与普通图的区别
            if i.get('has_gallery') == False:
                article_urls.append(i.get('article_url'))
            else:
                gallery_article_url.append(i.get('article_url'))
        else:
            continue
    # 返回一个元组，（[普通图片文章链接]，[手动轮播文章链接]）
    return article_urls,gallery_article_url


def get_img(urls):
    # 如果有普通文章链接
    if len(urls[0]) >= 1:
        for i in urls[0]:
            print(i)
            html = requests.get(url=i, headers=headers).text
            # 第一个括号匹配的是该文章的标题，第二个括号匹配的是包含图片链接的字符串
            img_str = re.search('articleInfo.*?title: \'&quot;(.*?)&quot;.*?content(.*?)groupId', html, re.S)
            # 有些链接无法下载，比如里面有微信公众号的文章（里面的图片并不好看，所以直接跳过，不做兼容）
            try:
                # 以每一篇文章的标题命名二级文件夹
                filename = img_str.group(1)
                # 过滤文件命名中的不合法字符
                filename = correct_title(filename)
                # 以标题创建文件夹
                try:
                    path = './zhangruonan/' + filename
                    # mkdir需要事先创建好一级文件夹，makedirs才会递归创建文件夹
                    os.mkdir(path)
                # 如果有重复文件名则在文件名后面随机加一个数字
                except FileExistsError:
                    path = './zhangruonan/' + filename + str(random.randint(1, 100))
                    os.mkdir(path)
                # 匹配图片的正则表达式
                img_pattern = '.*?http(:.*?)\&quot'
                img_list = re.findall(img_pattern, img_str.group(2), re.S)
                for i in img_list:
                    # 替换每个链接中多余的字符
                    i = i.replace('\\u002F', '/')
                    i = i.replace('\\', '')
                    # 拼接url
                    i = 'http' + i
                    # 截取链接最后一串字符作为图片的名称(包含了"/")
                    img_title = i[i.rfind("/"):]
                    request.urlretrieve(i, path + img_title + ".jpg")
                print(filename + "共" + str(len(img_list)) + "张图片下载完毕！！！")
            except AttributeError:
                print("此链接无法下载！！！")
                continue
    # 如果有手动轮播文章链接
    if len(urls[1]) >=1:
        for i in urls[1]:
            print(i)
            html = requests.get(url=i, headers=headers).text
            # 匹配手动轮播文章的标题和含有图片链接的字符串
            img_str = re.search('BASE_DATA.galleryInfo.*?title: \'(.*?)\'.*?gallery(.*?)\)', html,re.S)
            # 以下步骤和上面基本无异（因小生才疏学浅，写了一些重复代码，若有高手还请指点）
            try:
                text = img_str.group(2)
                filename = img_str.group(1)
                filename = correct_title(filename)
                try:
                    path = './zhangruonan/' + filename
                    os.mkdir(path)
                except FileExistsError:
                    path = './zhangruonan/' + filename + str(random.randint(1, 100))
                    os.mkdir(path)
                # 这个正则用于匹配图片链接，需要多揣摩
                img_pattern = r'{\\"url\\":\\"http([^,]*?)\\",\\"width.*?}'
                img_list = re.findall(img_pattern,text,re.S)
                for i in img_list:
                    i = i.replace(r'\\\u002F','/')
                    i = "http" + i
                    img_title = i[i.rfind("/"):]
                    request.urlretrieve(i, path + img_title + ".jpg")
                print(filename + "共" + str(len(img_list)) + "张图片下载完毕！！！")
            except AttributeError:
                print("此链接无法下载！！！")
                continue


if __name__ == "__main__":
    url = 'https://www.toutiao.com/api/search/content/'
    # 控制翻页
    for i in range(0,6):
        print("第"+str(i+1)+"页开始下载！！！")
        offset = i * 20
        urls = get_article_url(url,offset)
        get_img(urls)

Answer 15 · 2020-03-30T11:24:15.000Z

@siuszy 大佬打扰一下，就是参数里面有个_signature参数不加没关系吗?

Answer 16 · 2020-04-10T06:16:42.000Z

@siuszy 大佬打扰一下，就是参数里面有个_signature参数不加没关系吗?

这个签名好像会影响data内节点的顺序加上的话节点会按照网页顺序排列不加就是乱的

Answer 17 · 2020-04-21T04:41:29.000Z

请问一下images_pattern这个参数匹配的是哪里啊

Answer 18 · 2020-07-28T15:07:27.000Z

请问一下images_pattern这个参数匹配的是哪里啊

網頁有兩種形式, 一種是帶文字和圖片, 所有圖片展開的; 一種是網頁裡每次顯示一張圖片, 需要按箭頭才可以看下一張圖片.
所以如果在requests返回的text裡搜索到'JSON'字樣, 則可判斷為後者.

Answer 19 · 2020-08-05T13:47:11.000Z

献丑

借鉴了 @SofiaCherry 的避免文件名含有不合法字符方法, 感谢
验证了timestamp, _signature字段是可以不传的
在get_images方法中增加了判断

#!/usr/bin/python3

import requests
import os
from hashlib import md5
from multiprocessing.pool import Pool

headers = {
    'cookie': 'csrftoken=1490f6d92e97ce79f9e52dc4f3222608; ttcid=22698125819f4938826fc916af6b7e7355; SLARDAR_WEB_ID=f754d5f8-83ce-4577-8f77-a232e1708142; tt_webid=6856774172980561421; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6856774172980561421; __tasessionId=v8lvfouyt1596594815875; s_v_web_id=kdgrbwsk_9Ussu5RZ_AUmC_4DO5_8s8w_21Pv7qDIVeeE; tt_scid=AyiNMhl4GyKjhxNFpcm5AWgbRD7dsl-Zu4nBHWPBkHFf6lAynwUzX3zbMRIWr.De95f9',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}


def correct_title(title):
    # 避免文件名含有不合法字符
    # 文件名最好不要含有.，否则有的系统会报错
    error_set = ['/', '\\', ':', '*', '?', '"', '|', '<', '>', '.']
    for c in title:
        if c in error_set:
            title = title.replace(c, '')
    return title


def get_page(offset):

    params = {
        'aid': 24,
        'app_name': 'web_search',
        # 控制翻页的参数
        'offset': offset,
        'format': 'json',
        # 搜索图片的关键词
        'keyword': '街拍',
        'autoload': 'true',
        'count': 20,
        'en_qc': 1,
        'cur_tab': 1,
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    url = 'http://www.toutiao.com/api/search/content/'
    try:
        response = requests.get(url, params=params, headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:
        print(e)
        return None


def get_images(json):
    if json.get('data'):
        for item in json.get('data'):
            title = item.get('title')
            images = item.get('image_list')
            if images:
                for image in images:
                    yield {'image': image.get('url'),
                           'title': title
                           }
            else:
                print('跳过')


def save_image(item):
    dir_name = 'day13/' + correct_title(item.get('title'))
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
    try:
        response = requests.get(item.get('image'))
        if response.status_code == 200:
            file_path = '{0}/{1}.{2}'.format(dir_name,
                                             md5(response.content).hexdigest(), 'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to Save Image')


def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)


if __name__ == '__main__':
    # 控制翻页
    for i in range(0, 6):
        print("第"+str(i+1)+"页开始下载！！！")
        offset = i * 20
        main(offset)

Answer 20 · 2020-11-30T14:02:43.000Z

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool
import re
import random

def get_page(offset):
    headers = {
        'cookie': 'tt_webid=6787304267841324551; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6787304267841324551; csrftoken=6c8d91e61b7db691bfa45021a0e7e511; UM_distinctid=16ffb7f1fcfec-0d6566ad15973e-396a4605-144000-16ffb7f1fd02eb; s_v_web_id=k631j38t_qMPq6VOD_jioN_4lgi_BQB1_GGhAVEKoAmXJ; __tasessionId=ocfmlmswt1580527945483',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    params = {
        'aid': '24',
        'app_name': 'web_search',
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'en_qc': '1',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd': 'synthesis',
    }
    base_url = 'https://www.toutiao.com/api/search/content/?'
    url = base_url + urlencode(params)
    # print(url)
    try:
        resp = requests.get(url, headers=headers)
        if 200  == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None
    
def get_images(json):
    headers = {
        'cookie': 'tt_webid=6787304267841324551; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6787304267841324551; csrftoken=6c8d91e61b7db691bfa45021a0e7e511; UM_distinctid=16ffb7f1fcfec-0d6566ad15973e-396a4605-144000-16ffb7f1fd02eb; s_v_web_id=k631j38t_qMPq6VOD_jioN_4lgi_BQB1_GGhAVEKoAmXJ; __tasessionId=ocfmlmswt1580527945483',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
        'x-requested-with': 'XMLHttpRequest',
        'referer': 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D',
    }
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('title') is None: # 刨掉前部分无关内容
                continue
            title = re.sub('[\t]', '', item.get('title')) # 获取标题
            url = item.get("article_url")  #获取子链接
            if url == None:
                continue
            try:
                resp = requests.get(url,headers=headers)
                if 200  == resp.status_code:
                    images_pattern = re.compile('JSON.parse\("(.*?)"\),\n',re.S)
                    result = re.search(images_pattern,resp.text)
                    if result == None: # 非图集形式
                        images = item.get('image_list')
                        for image in images:
                            origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url')) # 改成origin/pgc-image是原图
                            yield {
                                'image': origin_image,
                                'title': title
                            }
                    else: # 图集形式 抓取gallery下json格式数据
                        url_pattern=re.compile('url(.*?)"width',re.S)
                        result1 = re.findall(url_pattern,result.group(1))
                        for i in range(len(result1)):
                            yield{
                                'image': "http://p%d.pstatp.com/origin/pgc-image/"%(random.randint(1,10)) + 
                                           result1[i][result1[i].rindex("u002F")+5:result1[i].rindex('\\"')], #存储url
                                'title': title
                            }
            except requests.ConnectionError: # 打开子链接失败就直接保存图集中前部分
                for image in images:
                    origin_image = re.sub("list.*?pgc-image", "large/pgc-image", image.get('url')) # 改成origin/pgc-image是原图
                    yield {
                        'image': origin_image,
                        'title': title
                    }
                
def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        os.makedirs(img_path) # 生成目录文件夹
    try:
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(), 
                file_suffix='jpg')  # 单一文件的路径
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(resp.content) 
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except Exception as e:
        print(e)
        
def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        save_image(item)

if __name__ == '__main__':
    '''
    for i in range(3):
        main(20*i)
    '''
    pool = Pool()
    groups = ([x * 20 for x in range(0, 3)])
    pool.map(main, groups)

代码主体参考了@Anodsaber的版本以及崔老师的视频讲解！加了一点点自己学习过程中的注释。

主要改动是在get_images函数中，增加了一个逻辑判断，如果打开搜索页的子链接后发现是以JSON格式加载的图集，就继续抓取url并返回一个迭代器；如果是直接加载出图集的形式，就按照固有的操作，即在搜索页中抓取image的url。另外设置了异常处理部分，如果打开子链接出现问题，那么就仍然使用搜索页中抓取的模式。

在抓取图集的子链接中发现子链接形式如下 http://p1.pstatp.com/origin/pgc-image/path ，其中p1部分的数字会随机跳动，自己不是很理解这之中的原理，也是出于谨慎加了一个随机抽取一个0-9的自然数构造url的小细节。

小白第一次发帖，多多指教！

我现在跑你的代码会出bug,楼主能调试改改代码吗？我找不出那里出了问题

Answer 21 · 2020-12-02T12:39:26.000Z

感谢更新，已经添加到 Issue！

OSError: [Errno 22] 文件名、目录名或卷标语法不正确。: 'img\博主街拍 | 穿出女大佬气质她们把针织大衣搭配的如此霸气'
您好,下载的时候有个别目录,会出现这个问题,能讲解一下吗?

Answer 22 · 2021-01-12T08:54:22.000Z

感谢更新，已经添加到问题！

OSError：[Errno 22]文件名，目录名或卷标语法不正确。：'img \博主街拍| 穿出女大佬气质他们把针织大衣搭配的如此霸气'
您好，下载的时候有单独的目录，会出现这个问题，能讲解一下吗？

windows文件夹的目录不能出现 | 这个符号，所以会报错

Answer 23 · 2021-01-12T08:56:14.000Z

谢谢你的回复，我在摸索摸索，如果你有解决这个bug希望你能贴在那个帖子下面，谢谢

…

------------------ 原始邮件 ------------------ 发件人: "knight-King-2019"<notifications@github.com>; 发送时间: 2021年1月12日(星期二) 下午4:54 收件人: "Python3WebSpider/Jiepai"<Jiepai@noreply.github.com>; 抄送: "1261320835"<1261320835@qq.com>; "Comment"<comment@noreply.github.com>; 主题: Re: [Python3WebSpider/Jiepai] 结合了遍历搜索页和图集的一个代码，做一个分享 (#25) 感谢更新，已经添加到问题！ OSError：[Errno 22]文件名，目录名或卷标语法不正确。：'img \博主街拍| 穿出女大佬气质他们把针织大衣搭配的如此霸气' 您好，下载的时候有单独的目录，会出现这个问题，能讲解一下吗？ windows文件夹的目录不能出现 | 这个符号，所以会报错 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.