zeor list

Question

zeor list

ehsannu opened this issue 6 years ago · 5 comments

I want to download all links/ titles of papers from the web using rvest. I used the following script but it is not the list is zero. Any suggestions?

library(rvest)

Download the HTML and turn it into an XML file with read_html()
Papers <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

Extract specific nodes with html_nodes()
Titles <- html_nodes(Papers, "span.optClickTitle")

ehsannu commented 6 years ago

Yes!

👍1

Answer 1 · 2018-09-28T01:59:28.000Z

library(rvest)

webpage <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")


title <- webpage %>%
	html_nodes(".optClickTitle") %>%
	html_text()

links <- webpage %>%
	html_nodes(".optClickTitle") %>%
	html_attr("href")

info <- data.frame(title, links)
info

Answer 2 · 2018-09-28T02:17:15.000Z

Thanks a lot! it works but it just scraps the records from the first page. Any suggestions?

Answer 3 · 2018-09-28T02:19:51.000Z

You want everything for all 219 pages?

Answer 4 · 2018-09-28T02:37:59.000Z

Be aware, the code is going to run for quite a bit of time. I recommend you export the resulting data frame right away as a csv file or whatever format you prefer.

library(rvest)
library(purrr)

scrape_paper_info <- function(link){
	
	node_of_interest <- read_html(link) %>%
		html_nodes(".optClickTitle")
	
	data.frame(
		title = html_text(node_of_interest, "title"),
		link = html_attr(node_of_interest, "href")
	)
}

links <- paste0("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=",
					 1:219,
					 "&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

paper_info <- map_dfr(links, scrape_paper_info)