The package htmldf
contains a single function html_df()
which
accepts a vector of urls as an input and from each will attempt to
download each page, extract and parse the html. The result is returned
as a tibble
where each row corresponds to a document, and the columns
contain page attributes and metadata extracted from the html, including:
- page title
- inferred language
- RSS feeds
- hyperlinks
- image links
- twitter, github and linkedin profiles
- the inferred programming language of any text with code tags
- page size, generator and server
- page accessed date
- page published or last updated dates
To install the package:
remotes::install_github('alastairrushworth/htmldf)
To use html_df
library(htmldf)
library(dplyr)
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
"https://www.tensorflow.org/tutorials/images/cnn",
"https://www.robertmylesmcdonnell.com/content/posts/mtcars/")
z <- html_df(urlx, show_progress = FALSE)
z
## # A tibble: 3 x 15
## url title lang url2 links rss images social code_lang size server
## <chr> <chr> <chr> <chr> <lis> <chr> <list> <list> <chr> <int> <chr>
## 1 httpā¦ Visuā¦ en httpā¦ <tibā¦ httpā¦ <tibbā¦ <tibbā¦ r 38198 GitHuā¦
## 2 httpā¦ Convā¦ en httpā¦ <tibā¦ <NA> <tibbā¦ <tibbā¦ py 96758 Googlā¦
## 3 httpā¦ Robeā¦ en httpā¦ <tibā¦ <NA> <tibbā¦ <tibbā¦ r 291099 Netliā¦
## # ā¦ with 4 more variables: accessed <dttm>, published <dttm>, generator <chr>,
## # source <list>
Page titles
z %>% select(title, url2)
## # A tibble: 3 x 2
## title url2
## <chr> <chr>
## 1 Visualising Tour De France Data Iā¦ https://alastairrushworth.github.io/Visualā¦
## 2 Convolutional Neural Network (CNNā¦ https://www.tensorflow.org/tutorials/imageā¦
## 3 Robert Myles McDonnell https://www.robertmylesmcdonnell.com/conteā¦
RSS feeds
z$rss
## [1] "https://alastairrushworth.github.io/feed.xml"
## [2] NA
## [3] NA
Social profiles
z$social
## [[1]]
## # A tibble: 2 x 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @rushworth_a https://twitter.com/rushworth_a
## 2 github @alastairrushworth https://github.com/alastairrushworth
##
## [[2]]
## # A tibble: 1 x 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @tensorflow https://twitter.com/tensorflow
##
## [[3]]
## # A tibble: 4 x 3
## site handle profile
## <chr> <chr> <chr>
## 1 twitter @robertmylesmc https://twitter.com/robertmylesmc
## 2 linkedin @robert-mcdonnell-7475bā¦ https://linkedin.com/in/robert-mcdonnell-74ā¦
## 3 github @coolbutuseless https://github.com/coolbutuseless
## 4 github @robertmyles https://github.com/robertmyles
Inferred code language
z %>% select(code_lang, url2)
## # A tibble: 3 x 2
## code_lang url2
## <chr> <chr>
## 1 r https://alastairrushworth.github.io/Visualising-Tour-de-France-dataā¦
## 2 py https://www.tensorflow.org/tutorials/images/cnn
## 3 r https://www.robertmylesmcdonnell.com/content/posts/mtcars/
Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.