/htmldf

šŸ–„ āœ‚ļø šŸ“ Simple scraping and tidy webpage summaries

Primary LanguageR

htmldf

Build Status codecov

Overview

The package htmldf contains a single function html_df() which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a tibble where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:

  • page title
  • inferred language
  • RSS feeds
  • hyperlinks
  • image links
  • twitter, github and linkedin profiles
  • the inferred programming language of any text with code tags
  • page size, generator and server
  • page accessed date
  • page published or last updated dates

Installation and usage

To install the package:

remotes::install_github('alastairrushworth/htmldf')

To use html_df

library(htmldf)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.2
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
          "https://www.tensorflow.org/tutorials/images/cnn", 
          "https://www.robertmylesmcdonnell.com/content/posts/mtcars/")
z <- html_df(urlx, show_progress = FALSE)
z
## # A tibble: 3 x 15
##   url   title lang  url2  links rss   images social code_lang   size server
##   <chr> <chr> <chr> <chr> <lis> <chr> <list> <list>     <dbl>  <int> <chr> 
## 1 httpā€¦ Visuā€¦ en    httpā€¦ <tibā€¦ httpā€¦ <tibbā€¦ <tibbā€¦     1      38445 GitHuā€¦
## 2 httpā€¦ Convā€¦ en    httpā€¦ <tibā€¦ <NA>  <tibbā€¦ <tibbā€¦    -0.936 110231 Googlā€¦
## 3 httpā€¦ Robeā€¦ en    httpā€¦ <tibā€¦ <NA>  <tibbā€¦ <tibbā€¦     1     291099 Netliā€¦
## # ā€¦ with 4 more variables: accessed <dttm>, published <dttm>, generator <chr>,
## #   source <chr>

Page titles

z %>% select(title, url2)
## # A tibble: 3 x 2
##   title                              url2                                       
##   <chr>                              <chr>                                      
## 1 Visualising Tour De France Data Iā€¦ https://alastairrushworth.github.io/Visualā€¦
## 2 Convolutional Neural Network (CNNā€¦ https://www.tensorflow.org/tutorials/imageā€¦
## 3 Robert Myles McDonnell             https://www.robertmylesmcdonnell.com/conteā€¦

RSS feeds

z$rss
## [1] "https://alastairrushworth.github.io/feed.xml"
## [2] NA                                            
## [3] NA

Social profiles

z$social
## [[1]]
## # A tibble: 3 x 3
##   site     handle                    profile                                    
##   <chr>    <chr>                     <chr>                                      
## 1 twitter  @rushworth_a              https://twitter.com/rushworth_a            
## 2 linkedin @alastair-rushworth-2531ā€¦ https://linkedin.com/in/alastair-rushworthā€¦
## 3 github   @alastairrushworth        https://github.com/alastairrushworth       
## 
## [[2]]
## # A tibble: 1 x 3
##   site    handle      profile                       
##   <chr>   <chr>       <chr>                         
## 1 twitter @tensorflow https://twitter.com/tensorflow
## 
## [[3]]
## # A tibble: 4 x 3
##   site     handle                   profile                                     
##   <chr>    <chr>                    <chr>                                       
## 1 twitter  @robertmylesmc           https://twitter.com/robertmylesmc           
## 2 linkedin @robert-mcdonnell-7475bā€¦ https://linkedin.com/in/robert-mcdonnell-74ā€¦
## 3 github   @coolbutuseless          https://github.com/coolbutuseless           
## 4 github   @robertmyles             https://github.com/robertmyles

Inferred code language (near 1 = R; near -1 = Python)

z %>% select(code_lang, url2)
## # A tibble: 3 x 2
##   code_lang url2                                                                
##       <dbl> <chr>                                                               
## 1     1     https://alastairrushworth.github.io/Visualising-Tour-de-France-dataā€¦
## 2    -0.936 https://www.tensorflow.org/tutorials/images/cnn                     
## 3     1     https://www.robertmylesmcdonnell.com/content/posts/mtcars/

Comments? Suggestions? Issues?

Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.