Parse and Test Robots Exclusion Protocol Files and Rules
The ‘Robots Exclusion Protocol’ https://www.robotstxt.org/orig.html documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The ‘rep-cpp’ https://github.com/seomoz/rep-cpp C++ library for processing these ‘robots.txt’ files.
The following functions are implemented:
can_fetch
: Test URL paths against a robxp robots.txt objectcrawl_delays
: Retrieve all agent crawl delay values in a robxp robots.txt objectprint.robxp
: Custom printer for ’robxp“ objectsrobxp
: Parse a ‘robots.txt’ file & create a ‘robxp’ objectsitemaps
: Retrieve a character vector of sitemaps from a parsed robots.txt object
install.packages("spiderbar", repos = c("https://cinc.rud.is", "https://cloud.r-project.org/"))
# or
remotes::install_git("https://git.rud.is/hrbrmstr/spiderbar.git")
# or
remotes::install_git("https://git.sr.ht/~hrbrmstr/spiderbar")
# or
remotes::install_gitlab("hrbrmstr/spiderbar")
# or
remotes::install_bitbucket("hrbrmstr/spiderbar")
# or
remotes::install_github("hrbrmstr/spiderbar")
NOTE: To use the ‘remotes’ install options you will need to have the {remotes} package installed.
library(spiderbar)
library(robotstxt)
# current verison
packageVersion("spiderbar")
## [1] '0.2.3'
# use helpers from the robotstxt package
rt <- robxp(get_robotstxt("https://cdc.gov"))
print(rt)
## <Robots Exclusion Protocol Object>
# or
rt <- robxp(url("https://cdc.gov/robots.txt"))
can_fetch(rt, "/asthma/asthma_stats/default.htm", "*")
## [1] TRUE
can_fetch(rt, "/_borders", "*")
## [1] FALSE
gh_rt <- robxp(robotstxt::get_robotstxt("github.com"))
can_fetch(gh_rt, "/humans.txt", "*") # TRUE
## [1] TRUE
can_fetch(gh_rt, "/login", "*") # FALSE
## [1] TRUE
can_fetch(gh_rt, "/oembed", "CCBot") # FALSE
## [1] TRUE
can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed"))
## [1] TRUE TRUE TRUE
crawl_delays(gh_rt)
agent | crawl_delay |
---|---|
baidu | 1 |
* | -1 |
imdb_rt <- robxp(robotstxt::get_robotstxt("imdb.com"))
crawl_delays(imdb_rt)
agent | crawl_delay |
---|---|
* | -1 |
sitemaps(imdb_rt)
## character(0)
Lang | # Files | (%) | LoC | (%) | Blank lines | (%) | # Lines | (%) |
---|---|---|---|---|---|---|---|---|
C++ | 9 | 0.39 | 1763 | 0.79 | 257 | 0.56 | 258 | 0.38 |
C/C++ Header | 7 | 0.30 | 395 | 0.18 | 152 | 0.33 | 280 | 0.42 |
R | 6 | 0.26 | 47 | 0.02 | 18 | 0.04 | 101 | 0.15 |
Rmd | 1 | 0.04 | 23 | 0.01 | 31 | 0.07 | 33 | 0.05 |
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.