rstudio/webinars

zeor list

ehsannu opened this issue · 5 comments

I want to download all links/ titles of papers from the web using rvest. I used the following script but it is not the list is zero. Any suggestions?

library(rvest)

Download the HTML and turn it into an XML file with read_html()
Papers <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

Extract specific nodes with html_nodes()
Titles <- html_nodes(Papers, "span.optClickTitle")

library(rvest)

webpage <- read_html("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=1&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")


title <- webpage %>%
	html_nodes(".optClickTitle") %>%
	html_text()

links <- webpage %>%
	html_nodes(".optClickTitle") %>%
	html_attr("href")

info <- data.frame(title, links)
info

Thanks a lot! it works but it just scraps the records from the first page. Any suggestions?

You want everything for all 219 pages?

Yes!

Be aware, the code is going to run for quite a bit of time. I recommend you export the resulting data frame right away as a csv file or whatever format you prefer.

library(rvest)
library(purrr)

scrape_paper_info <- function(link){
	
	node_of_interest <- read_html(link) %>%
		html_nodes(".optClickTitle")
	
	data.frame(
		title = html_text(node_of_interest, "title"),
		link = html_attr(node_of_interest, "href")
	)
}

links <- paste0("https://papers.ssrn.com/sol3/JELJOUR_Results.cfm?npage=",
					 1:219,
					 "&form_name=journalBrowse&journal_id=1475407&Network=no&lim=false")

paper_info <- map_dfr(links, scrape_paper_info)