뉴스 검색어 기반 크롤링 질문

Question

뉴스 검색어 기반 크롤링 질문

cogud0908 opened this issue 7 years ago · 9 comments

제가 https://github.com/forkonlp/N2H4/wiki/%EA%B2%80%EC%83%89%EC%96%B4-%EA%B8%B0%EB%B0%98-%EB%89%B4%EC%8A%A4-%EC%88%98%EC%A7%91
여기 있는 코드를 써서 크롤링을 하는데 진행시에 csv 파일이 생성되는데
생성될때마다 같은 내용밖에 없어서 질문드립니다...
뭐가 잘못 된걸까요??

Answer 1 · 2018-02-08T15:24:42.000Z

생성될때 마다 같은 내용이라고 하시면 다른 이름의 파일이 계속 생기지만 내용이 같다는 말씀이신가요?

Answer 2 · 2018-02-09T00:39:16.000Z

네 그리고 이전 날짜의 데이터를 크롤링하고 싶은데 현재 날짜를 기준으로 계속 최신뉴스만 받아오는 것 같습니다...

Answer 3 · 2018-02-09T01:15:22.000Z

확인해보겠습니다. 감사합니다.

Answer 4 · 2018-02-09T02:28:05.000Z

감사합니다

Answer 5 · 2018-02-10T05:49:50.000Z

검색어가 띄어쓰기를 지원하지 않네요 그래서 그런가봐요

Answer 6 · 2018-02-10T07:23:28.000Z

혹시 실행하신 코드 전체를 보여주실 수 있으신가요?

Answer 7 · 2018-02-10T08:26:45.000Z

install.packages("selectr")
install.packages("xml2")
library(curl)
library(rvest)

if (!require("devtools")) install.packages("devtools")
devtools::install_github("forkonlp/N2H4")
library(N2H4)

options(stringsAsFactors = F)

success <- function(res){
cat("Request done! Status:", res$status, "\n")
#res$content<-iconv(rawToChar(res$content),from="CP949",to="UTF-8")
res$content<-rawToChar(res$content)
data <<- c(data, list(res))
}
failure <- function(msg){
cat("Oh noes! Request failed!", msg, "\n")
}

strDate<-as.Date("2001-01-02")
endDate<-as.Date("2005-12-31")
strTime<-Sys.time()
midTime<-Sys.time()

qlist<-c("노인자살")
for (i in 1:length(qlist)){
dir.create("./data",showWarnings=F)
dir.create(paste0("./data/news_",qlist[i]),showWarnings=F)

for (date in strDate:endDate){
date<-as.character(as.Date(date,origin = "1970-01-01"))
dateo<-gsub("-",".",date)
dated<-gsub("-","",date)
print(paste0(date," / ",qlist[i], "/ start Time: ", strTime," // spent Time at first: ", Sys.time()-strTime))
midTime<-Sys.time()
pageUrli<-paste0("https://search.naver.com/search.naver?where=news&query=",qlist[i],"&ie=utf8&sm=tab_srt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=",dateo,"&de=",dateo,"&docid=&nso=so%3Ar%2Cp%3Afrom",dated,"to",dated,"%2Ca%3Aall&mynews=0&mson=0&refresh_start=0&related=0")
trym<-0
max<-try(getMaxPageNum(pageUrli, search=T), silent = T)
while(trym<=5&&class(max)=="try-error"){
max<-try(getMaxPageNum(pageUrli, search=T), silent = T)
Sys.sleep(abs(rnorm(1)))
trym<-trym+1
print(paste0("try again max num: ",pageUrli))
}
if(max=="no result"){
print("no naver news links this time")
next
}
for (pageNum in 1:max){
start<-(pageNum-1)*10+1
print(paste0(date," / ",qlist[i], "/ start Time: ", strTime," / spent Time at first: ", Sys.time()-strTime))
midTime<-Sys.time()
pageUrl<-paste0(pageUrli,"&start=",start)
tryp<-0
newsList<-try(getUrlListByQuery(pageUrl), silent = T)
while(tryp<=5&&class(newsList)=="try-error"){
newsList<-try(getUrlListByQuery(pageUrl), silent = T)
Sys.sleep(abs(rnorm(1)))
tryp<-tryp+1
print(paste0("try again max num: ",pageUrl))
}
if(newsList$news_links[1]=="no naver news"){
print("no naver news links this time")
next
}

  pool <- new_pool()
  data <- list()
  sapply(newsList$news_links, function(x) curl_fetch_multi(x,success,failure))
  res <- multi_run()
  
  if( identical(data, list()) ){
    pool <- new_pool()
    data <- list()
    sapply(newsList$news_links, function(x) curl_fetch_multi(x,success,failure))
    res <- multi_run()
  }
  
  closeAllConnections()
  
  loc<-sapply(data, function(x) grepl("^http://news.naver",x$url))
  cont<-sapply(data, function(x) x$content)
  cont<-cont[loc]
  
  if(identical(cont,character(0))){ 
    print("no naver news links this time")
    next
  }
  
  titles<-unlist(lapply(cont,function(x) getContentTitle(read_html(x))))
  bodies<-unlist(lapply(cont,function(x) getContentBody(read_html(x))))
  presses<-unlist(lapply(cont,function(x) getContentPress(read_html(x))))
  datetime<-lapply(cont,function(x) getContentDatetime(read_html(x))[1])
  datetime<-sapply(datetime, function(x) (as.character(x)[1]))
  edittime<-lapply(cont,function(x) getContentDatetime(read_html(x))[2])
  edittime<-sapply(edittime, function(x) (as.character(x)[1]))
  
  urls<-sapply(data, function(x) x$url)
  urls<-urls[loc]
  
  datC<-data.frame(titles,urls,presses,datetime,edittime,bodies)
  
  write.csv(datC, file=paste0("./data/news_",qlist[i],"/news_",date,"_",pageNum,".csv"),row.names = F, fileEncoding="euc-kr")
  
}

}
}

Answer 8 · 2018-02-10T08:27:12.000Z

2000~2005년 자료를 가지고 싶은데 계속 오류가 뜨네요 ㅠ

Answer 9 · 2018-05-16T01:47:10.000Z

검색 기반의 크롤링시 getMaxPageNum()함수가 동작하기 어려운 환경이 되어서 기능 지원을 종료하였습니다.