PRArulebook
is a package to scrape the PRA (Prudential Regulation
Authority) Rulebook (the website
containing the rules made and enforced by the PRA under powers conferred
by the Financial Services and Markets Act 2000 (FSMA)).
The input to this package is the PRA Rulebook website. Outputs from this package are the rules published on the PRA Rulebook website in a format more amenable to text and network analysis.
PRArulebook
was developed while preparing:
Amadxarif, Z., Brookes, J., Garbarino, N., Patel, R., Walczak, E. (2019) The Language of Rules: Textual Complexity in Banking Reforms. Staff Working Paper No. 834. Bank of England.
Any use of this package with the PRA Rulebook must comply with the PRA Rulebook’s Terms of Use. These include, but are not limited to, restrictions on using content from the PRA Rulebook for commercial purposes without obtaining a licence from the PRA.
Bank of England Data Classification: OFFICIAL BLUE
You can install the development version of PRArulebook
from GitHub
with:
install.packages("devtools")
devtools::install_github("erzk/PRArulebook")
PRArulebook
scrapes two types of data: structure and content.
-
Structure - hierarchy of the PRA Rulebook. Includes URLs and names.
-
Content
- Text - can be used for text analysis
- Network - can be used for network analysis.
The next section shows how to extract these types of data.
Load the package
library(PRArulebook)
The simplest way to extract a rulebook structure is to use
get_structure
function
# get the structure of the rulebook down to the part-level
parts <-
get_structure("16-11-2007",
layer = "part")
# or chapter-level
# warnings (410) are displayed for inactive sites
chapters <-
get_structure("18-06-2019",
layer = "chapter")
This will start scraping the PRA rulebook. Warnings
(410 code)
will be displayed when a page is no longer active. Pulling data will
take longer if you decide to pull more granular data. The rulebook has
several layers and each of them can be passed to the layer
argument of
get_structure
(in descending order):
sector
part
chapter
The output will be a data frame with information about the structure (i.e. URLs and names).
Scraping individual rules is much slower so another function should be used
# extract all rules from the first three chapters
rules <- scrape_rule_structure(chapters[1:3,], "18-06-2019")
Once the structure URLs are scraped, they can be used to extract content.
To get content of the rulebook (text or links) use get_content
function with a URL of a given chapter.
# scrape text from a single chapter
chapter_text <- get_content(chapters$chapter_url[1])
# or single rule
rule_text <- get_content(rules$rule_url[3], "text", "yes")
This function can be applied on the entire rulebook in the following way:
library(purrr)
# text: chapter-level
# exception handling might be needed
chapters_text <-
map_df(chapters$chapter_url[1:5],
get_content)
# text
rules_text <-
map_df(rules$rule_url[1:5],
get_content,
type = "text",
single_rule_selector = "yes")
# links
rules_links <-
map_df(rules$rule_url[1:5],
get_content,
type = "links",
single_rule_selector = "yes")
The output can be then joined to the information about the rulebook structure and aggregated at a higher level.
To scrape the links and create data set for network analysis
get_content
function can be used but with a type
argument set to
"links"
. Like in the previous example, this call can also be
parallelised.
chapter_link <- get_content(chapters$chapter_url[1], "links")
# sequential
parts_links <-
purrr::map_df(parts$part_url[1:5],
get_content,
"links")
The code above will return a data frame with from/to url, text used in a link, and a type of a link.
Scraped data containing information about the links can be used for network analysis (warning: further cleaning might be required).
This package is an outcome of a research project. All errors are mine. All views expressed are personal views, not those of any employer.