Scrapy 爬虫练习
xxx
表示的是需要更改的
projectName
表示工程名,可根据自己的需求进行更改。创建好后,工程的文件名就是工程名。
scrapy startproject projectName
创建蜘蛛模板,example是蜘蛛名,example.com是start_urls
cd projectName
scrapy genspider example example.com
注意:example 不能和项目名(projectName)相同
编辑 items.py
编辑 example.py
编辑 settings.py
编辑 pipelines.py
命令行内输入下列命令:
scrapy crawl example --nolog
注意:
- 这里一定要注意,一定要进入 projectName 这个目录,也就是我们创建的蜘蛛项目目录!
- ** crawl 后面跟的是类里面定义的蜘蛛名,并不是项目名、也不是类名!**
参考文章:编写第一个scrapy蜘蛛
创建工程:
scrapy startproject dictum
创建爬虫程序:
cd .\dictum\
scrapy genspider dictum_spider lab.scrapyd.cn
更改 dictum_spider.py 文件,代码见项目。
参考文章:Scrapy简单入门及实例讲解
scrapy startproject meiju_new100
cd .\meiju_new100\
scrapy genspider meiju meijutt.com
class MeijuItem(scrapy.Item):
rank = scrapy.Field()
name = scrapy.Field()
# -*- coding: utf-8 -*-
import scrapy
from ..items import MeijuItem
class MeijuSpider(scrapy.Spider):
name = 'meiju'
allowed_domains = ['meijutt.com']
start_urls = ['http://www.meijutt.com/new100.html']
def parse(self, response):
movies = response.xpath('//ul[@class="top-list fn-clear"]/li')
rank = 1
for each_movie in movies:
item = MeijuItem()
item['rank'] = str(rank)
item['name'] = each_movie.xpath('./h5/a/@title').extract()[0]
rank = rank + 1
yield item
ITEM_PIPELINES = {
'meiju_new100.pipelines.MeijuNew100Pipeline':100
}
class MeijuNew100Pipeline(object):
def process_item(self, item, spider):
with open("meiju.txt",'a') as fp:
fp.write(item['rank'] + '\t' + item['name'] + '\n')
scrapy crawl meiju
参考文章:Scrapy简单入门及实例讲解
scrapy startproject xiaohua_picture
cd .\xiaohua_picture\
scrapy genspider xiaohua xiaohuar.com
class XiaohuaPictureItem(scrapy.Item):
addr = scrapy.Field()
name = scrapy.Field()
# -*- coding: utf-8 -*-
import scrapy
from ..items import XiaohuaPictureItem
class XiaohuaSpider(scrapy.Spider):
# 爬虫名称,唯一
name = 'xiaohua'
# 允许访问的域
allowed_domains = ['xiaohuar.com']
# 初始URL
start_urls = ['http://www.xiaohuar.com/list-1-1.html']
def parse(self, response):
# 获取所有图片的 a 标签
all_pics = response.xpath('//div[@class="item_t"]')
for pic in all_pics:
# 分别处理每个图片,取出名称和地址
item = XiaohuaPictureItem()
name = pic.xpath('./div[@class="title"]/span/a/text()').extract()[0]
addr = pic.xpath('./div[@class="img"]/a/img/@src').extract()[0]
addr = 'http://www.xiaohuar.com' + addr
item['name'] = name
item['addr'] = addr
yield item
ITEM_PIPELINES = {
'xiaohua_picture.pipelines.XiaohuaPicturePipeline':100
}
# -*- coding: utf-8 -*-
import urllib.request
import os
class XiaohuaPicturePipeline(object):
def process_item(self, item, spider):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}
req = urllib.request.Request(url=item['addr'],headers=headers)
res = urllib.request.urlopen(req)
file_name = os.path.join(r'F:\xiaohua-pictures',item['name']+'.jpg')
with open(file_name,'wb') as fp:
fp.write(res.read())
scrapy crawl xiaohua
# -*- coding: utf-8 -*-
import scrapy
from ..items import XiaohuaPictureItem
class XiaohuaSpider(scrapy.Spider):
# 爬虫名称,唯一
name = 'xiaohua'
# 允许访问的域
allowed_domains = ['xiaohuar.com']
# 初始URL
start_urls = ['http://www.xiaohuar.com/list-1-1.html']
# 设置一个空集合
url_set = set()
def parse(self, response):
# 设置一个空集合
url_set = set()
def parse(self, response):
if response.url.startswith("http://www.xiaohuar.com/list-"):
# 获取所有图片的 a 标签
all_pics = response.xpath('//div[@class="item_t"]')
for pic in all_pics:
# 分别处理每个图片,取出名称和地址
item = XiaohuaPictureItem()
name = pic.xpath('./div[@class="title"]/span/a/text()').extract()[0]
addr = pic.xpath('./div[@class="img"]/a/img/@src').extract()[0]
addr = 'http://www.xiaohuar.com' + addr
item['name'] = name
item['addr'] = addr
yield item
# 获取所有的地址链接
urls = response.xpath("//a/@href").extract()
for url in urls:
# 如果地址以http://www.xiaohuar.com/list-开头且不在集合中,则获取其信息
if url.startswith("http://www.xiaohuar.com/list-"):
if url in XiaohuaSpider.url_set:
pass
else:
XiaohuaSpider.url_set.add(url)
yield scrapy.Request(url, callback=self.parse)
# yield self.make_requests_from_url(url) # 该函数已被弃用
else:
pass
scrapy crawl xiaohua
开始的步骤与上面一样,下面说几点不同的。
编辑 pipelines.py 文件:
import json
class LastampaPipeline(object):
def process_item(self, item, spider):
with open('../lastampa-data/lastampa_it_' + spider.name + '.json', 'a', encoding='utf-8') as f:
f.write(json.dumps(dict(item), ensure_ascii=False)+'\n')
f.flush()
编辑 settings.py 文件,在设置中启用这个 pipeline:
# 调用 pipelines,写入文件
ITEM_PIPELINES = {
'lastampa.pipelines.LastampaPipeline':100
}
编辑 settings.py 文件,添加如下代码:
# 随机添加头部信息
USER_AGENT_LIST=[
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENT_LIST)
注意:需要导入随机函数库:import random
cd .\lastampa\
cd .\lastampa\
scrapy crawl xxx
注意:
- 需要两次 cd,这样写入文件的路径才正确,如果路径不同,需要进行相应更改(绝对路径请忽略这条)
- 这里的
xxx
是蜘蛛文件的name