/htmldf

šŸ–„ āœ‚ļø šŸ“ Simple scraping and tidy webpage summaries

Primary LanguageR

Build Status codecov

htmldf

The package htmldf contains a single function html_df() which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a tibble where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:

  • page title
  • inferred language
  • RSS feeds
  • hyperlinks
  • image links
  • twitter, github and linkedin profiles
  • the inferred programming language of any text with code tags
  • page size, generator and server
  • page accessed date
  • page published or last updated dates

Installation and usage

To install the package:

remotes::install_github('alastairrushworth/htmldf)

To use html_df

library(htmldf)
library(dplyr)

urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
          "https://www.tensorflow.org/tutorials/images/cnn", 
          "https://www.robertmylesmcdonnell.com/content/posts/mtcars/")
z <- html_df(urlx, show_progress = FALSE)
z
## # A tibble: 3 x 15
##   url   title lang  url2  links rss   images social code_lang   size server
##   <chr> <chr> <chr> <chr> <lis> <chr> <list> <list> <chr>      <int> <chr> 
## 1 httpā€¦ Visuā€¦ en    httpā€¦ <tibā€¦ httpā€¦ <tibbā€¦ <tibbā€¦ r          38198 GitHuā€¦
## 2 httpā€¦ Convā€¦ en    httpā€¦ <tibā€¦ <NA>  <tibbā€¦ <tibbā€¦ py         96758 Googlā€¦
## 3 httpā€¦ Robeā€¦ en    httpā€¦ <tibā€¦ <NA>  <tibbā€¦ <tibbā€¦ r         291099 Netliā€¦
## # ā€¦ with 4 more variables: accessed <dttm>, published <dttm>, generator <chr>,
## #   source <list>

Page titles

z %>% select(title, url2)
## # A tibble: 3 x 2
##   title                              url2                                       
##   <chr>                              <chr>                                      
## 1 Visualising Tour De France Data Iā€¦ https://alastairrushworth.github.io/Visualā€¦
## 2 Convolutional Neural Network (CNNā€¦ https://www.tensorflow.org/tutorials/imageā€¦
## 3 Robert Myles McDonnell             https://www.robertmylesmcdonnell.com/conteā€¦

RSS feeds

z$rss
## [1] "https://alastairrushworth.github.io/feed.xml"
## [2] NA                                            
## [3] NA

Social profiles

z$social
## [[1]]
## # A tibble: 2 x 3
##   site    handle             profile                             
##   <chr>   <chr>              <chr>                               
## 1 twitter @rushworth_a       https://twitter.com/rushworth_a     
## 2 github  @alastairrushworth https://github.com/alastairrushworth
## 
## [[2]]
## # A tibble: 1 x 3
##   site    handle      profile                       
##   <chr>   <chr>       <chr>                         
## 1 twitter @tensorflow https://twitter.com/tensorflow
## 
## [[3]]
## # A tibble: 4 x 3
##   site     handle                   profile                                     
##   <chr>    <chr>                    <chr>                                       
## 1 twitter  @robertmylesmc           https://twitter.com/robertmylesmc           
## 2 linkedin @robert-mcdonnell-7475bā€¦ https://linkedin.com/in/robert-mcdonnell-74ā€¦
## 3 github   @coolbutuseless          https://github.com/coolbutuseless           
## 4 github   @robertmyles             https://github.com/robertmyles

Inferred code language

z %>% select(code_lang, url2)
## # A tibble: 3 x 2
##   code_lang url2                                                                
##   <chr>     <chr>                                                               
## 1 r         https://alastairrushworth.github.io/Visualising-Tour-de-France-dataā€¦
## 2 py        https://www.tensorflow.org/tutorials/images/cnn                     
## 3 r         https://www.robertmylesmcdonnell.com/content/posts/mtcars/

Comments? Suggestions? Issues?

Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.