soapgu/PlayPen

网络爬虫入门(一)——Scrapy初见

Opened this issue · 0 comments

  • 前言

网络爬虫名声在外但是却从来真正用过好像有点落伍,正好借调研的过程正好玩一下看看。
先立一个目标,获取抓取天气网站当前气温的温度。
图片
试试看抓取当前的温度数据。

  • 技术选型

无脑选择现在人气最高的python的scrapy框架,虽然不知道到底好不好。根据大数据,用的人多应该问题也不会特别大。

  • 实现过程

  1. 安装
pip install Scrapy

2.分析具体网页数据

图片 可以发现,好像用class=“sk-temp”来找比较方便
  1. 使用命令行方式“调试”抓取数据

这是Scrapy框架比较实用的一个工具,在不写一行代码的情况下,先进行数据抓取做一个实时调试的方式尝试抓取数据。
先试用一下,再来谈谈感觉

scrapy shell 'http://sh.weather.com.cn/'
>>> response.css('p.sk-temp').get()
'<p class="sk-temp"><span></span><em>℃</em></p>'

结构获取到了,好像数据没有啊
调用view(response),可以看预览
图片
问题是数据是动态获取到,静态页面并不包含数据

  1. 遇到问题了
    似乎目前只有两条解决方案
  • 本地模拟浏览器执行javascript脚本后再抓取数据,目前有中间件方案scrapy-splash,似乎成本也不小,先放一放
  • 还有一条就是抓取相关网络连接,直接拿api数据

选择直接先拿api数据试试

5.光速打脸
图片
这种大型网站应该加了反爬虫机制
图片

  • 请求路由不好定位
  • 参数不好定位
  • cookie也毫无规律

似乎直接硬来不行!

  1. 我原来被骗了!
    Cookie的值只是障眼法!
    我在网页里无法获取是Referer没加上
图片

6.创建爬虫项目

scrapy startproject temp_api   
图片
  1. 找出动态请求url
    由于获取的api请求是在主html中的,不能肯定这个请求是固定的,为了确保相对稳定性,需要从body中动态抓取
图片

经过一堆试错

>>> response.css('script[src*=sk_2d]').get()
'<script type="text/javascript" src="http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284"></script>'
>>> response.css('script[src*=sk_2d]').attrib["src"].get()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'str' object has no attribute 'get'
>>> response.css('script[src*=sk_2d]').attrib["src"]
'http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284'
>>> quit()

最终使用response.css('script[src*=sk_2d]').attrib["src"]抓取
可以看出这个脚本sk_2d是关键字,相对容易识别。

8.解析第二个请求的数据

第二个请求的数据其实是js
图片
这里官网推荐使用chompjs

  1. 完成Spider
import scrapy
import chompjs

class MySpider(scrapy.Spider):
    name = "temp_api"
    start_urls = [
        "http://sh.weather.com.cn/",
    ]

    def parse(self, response):
        next_url = response.css('script[src*=sk_2d]').attrib["src"]
        self.log(next_url)
        headers = {'Referer':'http://sh.weather.com.cn/'}
        yield scrapy.Request(next_url, callback=self.parse_script,headers=headers) 
    
    def parse_script(self,response):
        data = chompjs.parse_js_object(response.body.decode('utf-8'))
        yield data

注意这里response.body是二进制对象,需要解码才能使用

  1. 验证

执行爬虫导出到data.json

guhui@guhuideMacBook-Pro temp_api % scrapy crawl temp_api -o data.json  
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: temp_api)
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Versions: lxml 5.2.0.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.0, w3lib 2.1.2, Twisted 24.3.0, Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)], pyOpenSSL 24.1.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform macOS-14.2.1-arm64-arm-64bit
2024-04-03 22:13:54 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-03 22:13:54 [asyncio] DEBUG: Using selector: KqueueSelector
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet Password: c00d3fa735530f6c
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-04-03 22:13:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'temp_api',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'temp_api.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['temp_api.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-03 22:13:54 [scrapy.core.engine] INFO: Spider opened
2024-04-03 22:13:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2024-04-03 22:13:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> from <GET http://sh.weather.com.cn/robots.txt>
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> (referer: None)
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 13 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 33 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 34 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 35 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 50 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 51 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 52 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 54 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 55 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 57 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 58 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 59 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 60 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 61 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 62 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 71 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 72 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 73 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 75 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 76 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 78 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 79 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 80 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 81 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 82 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 99 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 106 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 110 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 112 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 114 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 119 without any user agent to enforce it on.
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.weather.com.cn/> (referer: None)
2024-04-03 22:13:55 [temp_api] DEBUG: http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284
2024-04-03 22:13:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.weather.com.cn/contacts_api.html> from <GET http://d1.weather.com.cn/robots.txt>
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/contacts_api.html> (referer: None)
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 7 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284> (referer: http://sh.weather.com.cn/)
2024-04-03 22:13:56 [scrapy.core.scraper] DEBUG: Scraped from <200 http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284>
{'nameen': 'shanghai', 'cityname': '上海', 'city': '101020100', 'temp': '14.2', 'tempf': '57.6', 'WD': '西北风', 'wde': 'NW', 'WS': '1级', 'wse': '1km/h', 'SD': '84%', 'sd': '84%', 'qy': '1015', 'njd': '8km', 'time': '22:00', 'rain': '0', 'rain24h': '0', 'aqi': '26', 'aqi_pm25': '26', 'weather': '多云', 'weathere': 'Cloudy', 'weathercode': 'd01', 'limitnumber': '', 'date': '04月03日(星期三)'}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-03 22:13:56 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2024-04-03 22:13:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1485,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'downloader/response_bytes': 65486,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 4,
 'downloader/response_status_count/301': 1,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 1.636783,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 4, 3, 14, 13, 56, 279235, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 11343,
 'httpcompression/response_count': 3,
 'item_scraped_count': 1,
 'log_count/DEBUG': 69,
 'log_count/INFO': 11,
 'memusage/max': 64684032,
 'memusage/startup': 64684032,
 'request_depth_max': 1,
 'response_received_count': 4,
 'robotstxt/request_count': 2,
 'robotstxt/response_count': 2,
 'robotstxt/response_status_count/200': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 4, 3, 14, 13, 54, 642452, tzinfo=datetime.timezone.utc)}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Spider closed (finished)
guhui@guhuideMacBook-Pro temp_api % 

看数据

[
{"nameen": "shanghai", "cityname": "上海", "city": "101020100", "temp": "14.2", "tempf": "57.6", "WD": "西北风", "wde": "NW", "WS": "1级", "wse": "1km/h", "SD": "84%", "sd": "84%", "qy": "1015", "njd": "8km", "time": "22:00", "rain": "0", "rain24h": "0", "aqi": "26", "aqi_pm25": "26", "weather": "多云", "weathere": "Cloudy", "weathercode": "d01", "limitnumber": "", "date": "04月03日(星期三)"}
]

和页面结果一致,完美

  • 总结

scrapy初次使用感觉还不错。
从爬虫的数据的思路上似乎分裂成两派

  • 分析推理型

图片

这个学派的**在于追本溯源,找出数据真正的请求来源。
优点:“冤有头债有主”,快意恩仇,对数据追查溯源比较清晰
缺点:调查成本大,特别对于反爬虫的大型网站挑战性大。

  • “自然发生”型

这个学派在于“模拟”浏览器的自然情况。自然情况下用户怎么拿数据的,就怎么拿数据。
优点:调查成本低,只有查找dom树就行
缺点:构建本地浏览器环境代价会高一些

那么“自然发生”型到底好不好还要用过才知道,所以到下一章来讲了