美图日网站图片爬虫。嘿嘿嘿,你懂的!
截止 2019-08-03,已累计爬取 88 个专业摄影机构,28832 个妹纸图册,1656140 张小姐姐的图片!!
截止 2019-06-15,更新至 27864
截止 2019-06-21,更新至 28467
(新增 612)截止 2019-08-03,更新至 28879
(新增 412)
注意:近日侦测到访问美图日域名报 443:
遂开始逐步寻找使用 http://www.tujidao.com/ 替代:
哼哼,先给目标网站来个简单的介绍:
美女写真机构 - 国内外各类美女写真机构厂商及写真集大全 - 美图日
1.1 网站中随意选取一张图片进行,发现图片真实路径为 https://ii.hywly.com/a/1/
+ 图册编号
+ /
+ 图片编号
+ .jpg
https://ii.hywly.com/a/1/27691/2.jpg
1.2 图册总数很多,不同分类下的图册可能重复,那么实际数量究竟有多少呢?由于图册编号是全局唯一的,因此可以通过请求一遍封面图来获取整个网站的图册数。
之所以选择封面图的另一个原因是因为封面图的分辨率为 249 * 375
,相比于高清图(1800 * 2700
),非常的小巧,高效!
@PostMapping("/step1")
public String step1() {
final String LOCAL_FOLDER = "F:/图片爬虫/封面图/";
for (int i = 28468 i <= 35000; i++) {
String onlinePath = MEITURI_IMG_URL_PREFIX + i + "/0.jpg";
String localPath = i + "-0.jpg";
String filePath = LOCAL_FOLDER + localPath;
DownloadUtil.downloadPicture(onlinePath, filePath);
}
return "success";
}
封面图数量:
可以看到,截止到 2019-06-15,图册总数为 27859
!由于网站的图片不定期地进行更新迭代,因此随着时间的增长,图册的数量会进一步增加。
1.3 图册问题解决了后,下一个问题是各图册对应的图片数问题。我们以相册为模型构造实体类。
@Entity
@Table(name = "tbl_meituri_album2")
public class AlbumDO {
/**
* ID 主键
*/
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
@Column(name = "id")
private Long id;
/**
* 编号
* eg.27691
*/
@Column(name = "number")
private Integer number;
/**
* 总数
* eg.58(0~58)
*/
@Column(name = "total")
private Integer total;
/**
* 标题(文件夹名的一部分)
* eg.杨晨晨sugar《蕾丝控福利》 [语画界XIAOYU] Vol.051 写真集
*/
@Column(name = "title")
private String title;
/**
* 所属机构
*
* @see InstitutionTypeEnum
*/
@Column(name = "institution_type")
private Integer institutionType;
// Getter and Setter……
}
Step2: 获取真正能访问的图册,并使用(Jsoup)获取各相册的图片数量,持久化到数据库(MySQL)中。
Step1 中获得的 27859 个相册并不都能正常访问。且由于下载业务需要知道各相册的图片数量以便进行遍历访问(通过暴力请求需要处理异常,效率低下,且容易由于网络波动而导致循环断掉)。
@PostMapping("/step2")
public String step2() {
// 当前相册最小编号
final int ALBUM_MIN = 27865;
// 当前相册最大编号
final int ALBUM_MAX = 28467;
// 用于提取单个相册图片总数的正则
final Pattern p = Pattern.compile("\\d+P");
for (int i = ALBUM_MIN; i <= ALBUM_MAX; i++) {
if (albumJpaDAO.findByNumberEquals(i) != null) {
logger.warn("==>step2() i={} 记录已存在,不再重复记录", i);
continue;
}
Document document;
try {
document = Jsoup.connect(MEITURI_URL_PREFIX + i).get();
} catch (IOException e) {
// 打印 ERROR 级别 log 以便人工介入确认不能访问的真正原因。
logger.error("==>step2() i={} 需人工介入确认不能访问的真正原因", i);
// 请求失败直接进入下个循环
continue;
}
String title = document.title();
Elements elements = document.getElementsByTag("p");
String elementsStr = elements.text();
// 分析获取单个相册的图片总数
Matcher matcher = p.matcher(elementsStr);
int total = 0;
if (matcher.find()) {
total = Integer.parseInt(matcher.group(0).replace("P", ""));
}
// 持久化到数据库,注意此处并没有根据 InstitutionTypeEnum 枚举进行 institutionType 的赋值,
// 这一步将涉及数据清洗,较为繁琐,代码未给出
AlbumDO albumDO = new AlbumDO();
albumDO.setNumber(i);
albumDO.setTotal(total);
albumDO.setTitle(title);
albumJpaDAO.save(albumDO);
// 打印 log,可省略
logger.info("==>step2() albumDO.getNumber={} albumDO.getTotal={} albumDO.getTitle={}", albumDO.getNumber(), albumDO.getTotal(), albumDO.getTitle());
}
return "success";
}
结果:{555, 2567, 2568, 2578, 2684, 4359, 4375, 4398, 5237, 5254,
5259, 7244, 7457, 7489, 8188, 8279, 8350, 8375, 12101, 12118,
12160, 12930, 13074, 14944, 15613, 16559, 16683, 17728, 19385, 21688,
22449, 22565, 23376, 23427, 23983, 24063, 24083, 24197, 24290, 27271,
27272, 27273}; 共 42 个相册不能正常访问。
不能正常访问的相册:
存在两个机构的相册:
MySQL 截图
SQL 数据清洗:
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '57' WHERE `title` like '%Beautyleg%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '49' WHERE `title` like '%丽柜%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '31' WHERE `title` like '%克拉女神%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '62' WHERE `title` like '%尤果圈爱尤物%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '63' WHERE `title` like '%尤果%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '73' WHERE `title` like '%尤物馆%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '12' WHERE `title` like '%异思趣向%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '18' WHERE `title` like '%LOVEPOP%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '17' WHERE `title` like '%Digi%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '35' WHERE `title` like '%Minisuka%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '85' WHERE `title` like '%语画界%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '23' WHERE `title` like '%花漾%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '78' WHERE `title` like '%**正妹%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '78' WHERE `title` like '%**女神%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '78' WHERE `title` like '%**美女%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '67' WHERE `title` like '%嗲囡囡%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '45' WHERE `title` like '%头条女神%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '58' WHERE `title` like '%尤蜜荟%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '59' WHERE `title` like '%秀人%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '87' WHERE `title` like '%萝莉COS%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '86' WHERE `title` like '%风之领域%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '72' WHERE `title` like '%魅妍社%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '80' WHERE `title` like '%Cosdoki%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '28' WHERE `title` like '%Sabra%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '33' WHERE `title` like '%WPB-net%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '70' WHERE `title` like '%模范学院%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '88' WHERE `title` like '%丝意%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '74' WHERE `title` like '%美媛馆%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '82' WHERE `title` like '%森萝财团%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '55' WHERE `title` like '%御女郎%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '15' WHERE `title` like '%星颜社%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '68' WHERE `title` like '%爱蜜社%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '30' WHERE `title` like '%瑞丝馆%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '34' WHERE `title` like '%网红馆%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '46' WHERE `title` like '%爱秀ISHOW%' and institution_type is null;
UPDATE `dev`.`tbl_meituri_album2` SET `institution_type` = '88' WHERE `title` like '%阳光宝贝%' and institution_type is null;
通过对应的 GET 请求参数下载对应的图册。达到手动调节线程数的效果。为了避免重复请求导致重复下载,该接口做了幂等处理。
private void doBatchDownload(List<AlbumDO> albumDOList) {
for (AlbumDO albumDO : albumDOList) {
int total = albumDO.getTotal();
int num = albumDO.getNumber();
String title = albumDO.getTitle();
String fileFolder = InstitutionTypeEnum.getEnumBySeq(albumDO.getInstitutionType()).getDesc();
String localFolder = MEITURI_LOCAL_PREFIX + fileFolder + "/" + num + "-" + title;
// 若文件夹路径不存在,则新建
File file = new File(localFolder);
if (!file.exists()) {
if (!file.mkdirs()) {
logger.error("==>number={} title={} 创建文件路径失败", num, title);
}
}
for (int i = 0; i <= total; i++) {
String onlinePath = MEITURI_IMG_URL_PREFIX + num + "/" + i + ".jpg";
String localPath = localFolder + "/" + i + ".jpg";
// 幂等,若当前文件未下载,则进行下载
File file2 = new File(localPath);
if (!file2.exists()) {
DownloadUtil.downloadPicture(onlinePath, localPath);
}
}
}
}
spring boot 配置文件参考:
server.port=8088
spring.datasource.url=jdbc:mysql://localhost:3306/dev?useUnicode=true&characterEncoding=UTF-8
spring.datasource.username=root
spring.datasource.password=Mysql@2019
spring.datasource.driverClassName=com.mysql.cj.jdbc.Driver
MySQL 统计:
Step2 统计得出能正常访问的相册共 27817 个,图片总数为 1602076 张。