/tidyweb

A package to tidy tree-shaped data (xml, html)

Primary LanguageR

tidyweb

Use tidy data principles to interact with HTML files!

This package is meant to ease web scraping with Selenium by “tidying” the html structure. To do so, it iterates recursively on web elements until a given depth and returns a tibble, with the children elements nested in list-columns. That way, tidy principles can be used to identify specific elements and eventually interact with them.

Install

# remotes::install_github("benjaminguinaudeau/tidyweb")
library(tidyweb)
library(dplyr)

How to use with Rvest?

page <- xml2::read_html("https://www.nytimes.com/")

art <- page %>%
  rvest::html_nodes("article")


parsed_art <- art %>% tidy_element(depth = 10) 

parsed_art %>% glimpse
parsed_art %>% filter(!is.na(href)) %>% glimpse
parsed_art %>% 
  separate_rows(class, sep = "\\s+") %>%
  count(class, sort = T) %>%
  glimpse

parsed_art %>% 
  mutate(depth = str_count(.id, "_") + 1) %>%
  group_by(depth) %>%
  ggplot(aes(x = depth)) + geom_histogram()
  

Thanks

A huge thank you to Favstats for designing each of the hex-stickers.