Java Archive Wrapper Supporting the ‘htmlunit’ Package
Contents of the ‘HtmlUnit’ & supporting Java archives (https://htmlunit.sourceforge.net/). Version number reflects the version number of the included ‘JAR’ file.
HtmlUnit
is a “GUI-Less browser for Java programs”. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your “normal” browser.It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating Chrome, Firefox or Internet Explorer depending on the configuration used.
It is typically used for testing purposes or to retrieve information from web sites.
HtmlUnit
is not a generic unit testing framework. It is specifically a way to simulate a browser.
Everything necessary to use the HtmlUnit library directly via rJava
.
HtmlUnit
Library JavaDoc:
https://htmlunit.sourceforge.net/apidocs/index.html
install.packages("htmlunitjars", repos = c("https://cinc.rud.is", "https://cloud.r-project.org/"))
# or
remotes::install_git("https://git.rud.is/hrbrmstr/htmlunitjars.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/htmlunitjars")
# or
remotes::install_gitlab("hrbrmstr/htmlunitjars")
# or
remotes::install_bitbucket("hrbrmstr/htmlunitjars")
# or
remotes::install_github("hrbrmstr/htmlunitjars")
NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.
library(htmlunitjars)
# current verison
packageVersion("htmlunitjars")
## [1] '2.40.0'
xml2::read_html()
cannot execute javascript so the traditional
approach won’t work:
library(rvest)
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"
doc <- read_html(test_url)
html_table(doc)
## list()
We can do this with the classes from HtmlUnit
proivided by this JAR
wrapper package:
library(htmlunitjars)
Tell HtmlUnit
to work like FireFox:
browsers <- J("com.gargoylesoftware.htmlunit.BrowserVersion")
wc <- new(J("com.gargoylesoftware.htmlunit.WebClient"), browsers$CHROME)
Tell it to wait for javascript to execute and not throw exceptions on page resource errors:
invisible(wc$waitForBackgroundJavaScriptStartingBefore(.jlong(2000L)))
wc_opts <- wc$getOptions()
wc_opts$setThrowExceptionOnFailingStatusCode(FALSE)
wc_opts$setThrowExceptionOnScriptError(FALSE)
Now, acccess the site again and get the table:
pg <- wc$getPage(test_url)
doc <- read_html(pg$asXml())
html_table(doc)
## [[1]]
## X1 X2
## 1 One Two
## 2 Three Four
## 3 Five Six
No need for Selenium or Splash!
The ultimate goal is to have an htmlunit
package that provides a nicer
API than needing to know how to work with rJava
directly.
Lang | # Files | (%) | LoC | (%) | Blank lines | (%) | # Lines | (%) |
---|---|---|---|---|---|---|---|---|
XML | 1 | 0.09 | 69 | 0.41 | 0 | 0.00 | 0 | 0.00 |
Java | 2 | 0.18 | 28 | 0.17 | 5 | 0.11 | 18 | 0.17 |
Maven | 1 | 0.09 | 23 | 0.14 | 1 | 0.02 | 2 | 0.02 |
Rmd | 1 | 0.09 | 21 | 0.12 | 35 | 0.74 | 50 | 0.47 |
R | 5 | 0.45 | 15 | 0.09 | 1 | 0.02 | 36 | 0.34 |
make | 1 | 0.09 | 13 | 0.08 | 5 | 0.11 | 0 | 0.00 |