网络爬虫入门(一)——Scrapy初见
Opened this issue · 0 comments
soapgu commented
-
前言
网络爬虫名声在外但是却从来真正用过好像有点落伍,正好借调研的过程正好玩一下看看。
先立一个目标,获取抓取天气网站当前气温的温度。
试试看抓取当前的温度数据。
-
技术选型
无脑选择现在人气最高的python的scrapy框架,虽然不知道到底好不好。根据大数据,用的人多应该问题也不会特别大。
-
实现过程
- 安装
pip install Scrapy
2.分析具体网页数据
可以发现,好像用class=“sk-temp”来找比较方便- 使用命令行方式“调试”抓取数据
这是Scrapy框架比较实用的一个工具,在不写一行代码的情况下,先进行数据抓取做一个实时调试的方式尝试抓取数据。
先试用一下,再来谈谈感觉
scrapy shell 'http://sh.weather.com.cn/'
>>> response.css('p.sk-temp').get()
'<p class="sk-temp"><span></span><em>℃</em></p>'
结构获取到了,好像数据没有啊
调用view(response),可以看预览
问题是数据是动态获取到,静态页面并不包含数据
- 遇到问题了
似乎目前只有两条解决方案
- 本地模拟浏览器执行javascript脚本后再抓取数据,目前有中间件方案scrapy-splash,似乎成本也不小,先放一放
- 还有一条就是抓取相关网络连接,直接拿api数据
选择直接先拿api数据试试
- 请求路由不好定位
- 参数不好定位
- cookie也毫无规律
似乎直接硬来不行!
- 我原来被骗了!
Cookie的值只是障眼法!
我在网页里无法获取是Referer没加上
6.创建爬虫项目
scrapy startproject temp_api
- 找出动态请求url
由于获取的api请求是在主html中的,不能肯定这个请求是固定的,为了确保相对稳定性,需要从body中动态抓取
经过一堆试错
>>> response.css('script[src*=sk_2d]').get()
'<script type="text/javascript" src="http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284"></script>'
>>> response.css('script[src*=sk_2d]').attrib["src"].get()
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'str' object has no attribute 'get'
>>> response.css('script[src*=sk_2d]').attrib["src"]
'http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284'
>>> quit()
最终使用response.css('script[src*=sk_2d]').attrib["src"]抓取
可以看出这个脚本sk_2d是关键字,相对容易识别。
8.解析第二个请求的数据
- 完成Spider
import scrapy
import chompjs
class MySpider(scrapy.Spider):
name = "temp_api"
start_urls = [
"http://sh.weather.com.cn/",
]
def parse(self, response):
next_url = response.css('script[src*=sk_2d]').attrib["src"]
self.log(next_url)
headers = {'Referer':'http://sh.weather.com.cn/'}
yield scrapy.Request(next_url, callback=self.parse_script,headers=headers)
def parse_script(self,response):
data = chompjs.parse_js_object(response.body.decode('utf-8'))
yield data
注意这里response.body是二进制对象,需要解码才能使用
- 验证
执行爬虫导出到data.json
guhui@guhuideMacBook-Pro temp_api % scrapy crawl temp_api -o data.json
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: temp_api)
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Versions: lxml 5.2.0.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.0, w3lib 2.1.2, Twisted 24.3.0, Python 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)], pyOpenSSL 24.1.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform macOS-14.2.1-arm64-arm-64bit
2024-04-03 22:13:54 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-03 22:13:54 [asyncio] DEBUG: Using selector: KqueueSelector
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet Password: c00d3fa735530f6c
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2024-04-03 22:13:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'temp_api',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'temp_api.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['temp_api.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-03 22:13:54 [scrapy.core.engine] INFO: Spider opened
2024-04-03 22:13:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2024-04-03 22:13:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> from <GET http://sh.weather.com.cn/robots.txt>
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> (referer: None)
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 13 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 33 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 34 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 35 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 50 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 51 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 52 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 54 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 55 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 57 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 58 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 59 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 60 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 61 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 62 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 71 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 72 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 73 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 75 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 76 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 78 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 79 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 80 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 81 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 82 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 99 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 106 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 110 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 112 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 114 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 119 without any user agent to enforce it on.
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.weather.com.cn/> (referer: None)
2024-04-03 22:13:55 [temp_api] DEBUG: http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284
2024-04-03 22:13:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.weather.com.cn/contacts_api.html> from <GET http://d1.weather.com.cn/robots.txt>
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/contacts_api.html> (referer: None)
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 7 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284> (referer: http://sh.weather.com.cn/)
2024-04-03 22:13:56 [scrapy.core.scraper] DEBUG: Scraped from <200 http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284>
{'nameen': 'shanghai', 'cityname': '上海', 'city': '101020100', 'temp': '14.2', 'tempf': '57.6', 'WD': '西北风', 'wde': 'NW', 'WS': '1级', 'wse': '1km/h', 'SD': '84%', 'sd': '84%', 'qy': '1015', 'njd': '8km', 'time': '22:00', 'rain': '0', 'rain24h': '0', 'aqi': '26', 'aqi_pm25': '26', 'weather': '多云', 'weathere': 'Cloudy', 'weathercode': 'd01', 'limitnumber': '', 'date': '04月03日(星期三)'}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-03 22:13:56 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2024-04-03 22:13:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1485,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 65486,
'downloader/response_count': 6,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 1.636783,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 4, 3, 14, 13, 56, 279235, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 11343,
'httpcompression/response_count': 3,
'item_scraped_count': 1,
'log_count/DEBUG': 69,
'log_count/INFO': 11,
'memusage/max': 64684032,
'memusage/startup': 64684032,
'request_depth_max': 1,
'response_received_count': 4,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/200': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2024, 4, 3, 14, 13, 54, 642452, tzinfo=datetime.timezone.utc)}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Spider closed (finished)
guhui@guhuideMacBook-Pro temp_api %
看数据
[
{"nameen": "shanghai", "cityname": "上海", "city": "101020100", "temp": "14.2", "tempf": "57.6", "WD": "西北风", "wde": "NW", "WS": "1级", "wse": "1km/h", "SD": "84%", "sd": "84%", "qy": "1015", "njd": "8km", "time": "22:00", "rain": "0", "rain24h": "0", "aqi": "26", "aqi_pm25": "26", "weather": "多云", "weathere": "Cloudy", "weathercode": "d01", "limitnumber": "", "date": "04月03日(星期三)"}
]
和页面结果一致,完美
-
总结
scrapy初次使用感觉还不错。
从爬虫的数据的思路上似乎分裂成两派
- 分析推理型
这个学派的**在于追本溯源,找出数据真正的请求来源。
优点:“冤有头债有主”,快意恩仇,对数据追查溯源比较清晰
缺点:调查成本大,特别对于反爬虫的大型网站挑战性大。
- “自然发生”型
这个学派在于“模拟”浏览器的自然情况。自然情况下用户怎么拿数据的,就怎么拿数据。
优点:调查成本低,只有查找dom树就行
缺点:构建本地浏览器环境代价会高一些
那么“自然发生”型到底好不好还要用过才知道,所以到下一章来讲了