ehsong/navernewscrawler

peformance checking

Opened this issue · 4 comments

jaeyk commented

Eunho: very nice code! I just would like to point out that it'd be great if you can add a line in your function (likely output function) that would help a user to see whether the script is running without an error. Specifically, something that can indicate a progress status would be much helpful.

Hi Jae! Thank you for your suggestion. If you have seen a good case somewhere, could you direct me to an example I can look at? Also, the uploaded script is for building a package, so I will see if I can upload a separate one that could include the output function you are referring to.

If you are worried about whether the script works, for now you can try running the script for short start and end date period & small max page limit, which will return the output pretty quickly (within 5 min).

jaeyk commented

I added two lines of print statements within your output function. These print outputs are more informative than a progress bar.

for date in date_range:
start_page = page
s_date = date.replace(".","")
while start_page < max_page:
url = "https://search.naver.com/search.naver?where=news&query=" + query + "&sort=0&ds=" + date + "&de=" + date + "&nso=so%3Ar%2Cp%3Afrom" + s_date + "to" + s_date + "%2Ca%3A&start=" + str(start_page)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
req = requests.get(url,headers=header)
cont = req.content
soup = BeautifulSoup(cont, 'html.parser')
print("I am webscraping: " + str(date))
for urls in soup.select("._sp_each_url"):
try:
if urls["href"].startswith("https://news.naver.com"):
news_detail = get_news(urls["href"])
adict = dict()
adict["title"] = news_detail[0]
adict["date"] = news_detail[1]
adict["company"] = news_detail[3]
# adict["text"] = news_detail[2]
news_dicts.append(adict)
print(urls["href"])
except Exception as e:
continue
start_page += 10
return news_dicts

jaeyk commented

Also, you can easily do a benchmark using for loop instead of while loop is much faster=)

I get what you mean now! I'll update the script with the print statement.