今天花了一点时间抓取了网易云音乐的热门民谣歌单,共1500热门民谣歌单,后续有时间会爬取其他分类。 下面记录一下java爬取过程。见下:
1.首先抓取各个歌单的url与标题
public static void DoPachong( String url_str, String charset) throws ClientProtocolException, IOException{
HttpClient hc = new DefaultHttpClient();
HttpGet hg = new HttpGet(url_str);
HttpResponse response = hc.execute(hg);
HttpEntity entity = response.getEntity();
InputStream htm_in = null;
if(entity != null){
htm_in = entity.getContent();
String htm_str = InputStream2String(htm_in,charset);
Document doc = Jsoup.parse(htm_str);
Elements links= doc.select("div[class=g-bd]").select("div[class=g-wrap p-pl f-pr]").select("ul[class=m-cvrlst f-cb]").select("div[class=u-cover u-cover-1");
for (Element link : links) {
Elements lin = link.select("a");
String re_url = lin.attr("href");
String re_title = lin.attr("title");
re_url = "http://music.163.com"+re_url;
System.out.print(re_title+" ");
System.out.print(re_url+" ");
SecondPaChong(re_url,charset);
}
}
}
2.根据抓取的url进一步用jsoup解析收听量
public static void SecondPaChong( String url_str, String charset) throws ClientProtocolException, IOException{
HttpClient hc = new DefaultHttpClient();
HttpGet hg = new HttpGet(url_str);
HttpResponse response = hc.execute(hg);
HttpEntity entity = response.getEntity();
InputStream htm_in = null;
if(entity != null){
htm_in = entity.getContent();
String htm_str = InputStream2String(htm_in,charset);
Document doc = Jsoup.parse(htm_str);
String links= doc.select("div[class=u-title u-title-1 f-cb]").select("div[class=more s-fc3]").select("strong").text();
System.out.println(links);
}
}
民谣歌单收听量前10:
-
如果你想听民谣,可以从这些歌曲开始。 收听量:11548417
-
民谣是最安静的角落 收听量:10727168
-
孤独旅人配民谣。 收听量:9946952
-
你若听过他的歌,此生便有了挂念 收听量:7551374
-
♬女生嘛,污一点才可爱 收听量:6260712
-
阅尽沧桑,洗却铅华:聆听那些沧桑之声 收听量:5793889
-
民谣,成长中的情绪共谋者 收听量:5368672
-
华语女声‖那些入耳入心的代表曲 收听量:4535668
-
啤酒邂逅音乐之华语摇滚 收听量:4449337
-
**民谣精选集 收听量:4423420