The dbcooper package turns a database connection into a collection of functions, handling logic for keeping track of connections and letting you take advantage of autocompletion when exploring a database.
It's especially helpful to use when authoring database-specific R packages, for instance in an internal company package or one wrapping a public data source.
The package's name is a reference to the bandit D.B. Cooper.
- For the Python version of the package, see machow/dbcooper-py.
- For an example of database packages created with dbcooper, see stackbigquery or lahmancooper
- For some slides about the package, see here
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("pipeline-tools/dbcooper")
The dbcooper package asks you to create the connection first. As an example, we'll use the Lahman baseball database packaged with dbplyr.
library(dplyr)
lahman_db <- dbplyr::lahman_sqlite()
lahman_db
#> <SQLiteConnection>
#> Path: /tmp/RtmpeApR4v/lahman.sqlite
#> Extensions: TRUE
You set up dbcooper with the dbc_init
function, passing it a prefix lahman
that will apply to all the functions it creates.
library(dbcooper)
dbc_init(lahman_db, "lahman")
dbc_init
then creates user-friendly accessor functions in your global environment. (You could also pass it an environment in which the functions will be created).
dbc_init
adds several functions when it initializes a database source. In this case, each will start with the lahman_
prefix.
_list
: Get a list of tables_tbl
: Access a table that can be worked with in dbplyr_query
: Perform of a SQL query and work with the result_execute
: Execute a query (such as aCREATE
orDROP
)_src
: Retrieve adbi_src
for the database
For instance, we could start by finding the names of the tables in the Lahman database.
lahman_list()
#> [1] "AllstarFull" "Appearances" "AwardsManagers" "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers" "Batting" "BattingPost"
#> [9] "CollegePlaying" "Fielding" "FieldingOF" "FieldingOFsplit" "FieldingPost" "HallOfFame" "HomeGames" "LahmanData"
#> [17] "Managers" "ManagersHalf" "Parks" "People" "Pitching" "PitchingPost" "Salaries" "Schools"
#> [25] "SeriesPost" "Teams" "TeamsFranchises" "TeamsHalf" "sqlite_stat1" "sqlite_stat4"
We can access one of these tables with lahman_tbl()
, then put it through any kind of dplyr operation.
lahman_tbl("Batting")
#> # Source: SQL [?? x 22]
#> # Database: sqlite 3.39.2 [/tmp/RtmpeApR4v/lahman.sqlite]
#> playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
#> <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA 0
#> 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA 0
#> 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA 1
#> 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA 0
#> 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA 0
#> 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA 0
#> 7 barkeal01 1871 1 RC1 NA 1 4 0 1 0 0 0 2 0 0 1 0 NA NA NA NA 0
#> 8 barnero01 1871 1 BS1 NA 31 157 66 63 10 9 0 34 11 6 13 1 NA NA NA NA 1
#> 9 barrebi01 1871 1 FW1 NA 1 5 1 1 1 0 0 1 0 0 0 0 NA NA NA NA 0
#> 10 barrofr01 1871 1 BS1 NA 18 86 13 13 2 1 0 11 1 0 0 0 NA NA NA NA 0
#> # … with more rows
lahman_tbl("Batting") %>%
count(teamID, sort = TRUE)
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.39.2 [/tmp/RtmpeApR4v/lahman.sqlite]
#> # Ordered by: desc(n)
#> teamID n
#> <chr> <int>
#> 1 CHN 5129
#> 2 PHI 5026
#> 3 PIT 4984
#> 4 SLN 4904
#> 5 CIN 4786
#> 6 CLE 4731
#> 7 BOS 4571
#> 8 NYA 4530
#> 9 CHA 4523
#> 10 DET 4462
#> # … with more rows
If we'd rather write SQL than dplyr, we could also run lahman_query()
(which can also take a filename).
lahman_query("SELECT
playerID,
sum(AB) as AB
FROM Batting
GROUP BY playerID")
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.39.2 [/tmp/RtmpeApR4v/lahman.sqlite]
#> playerID AB
#> <chr> <int>
#> 1 aardsda01 4
#> 2 aaronha01 12364
#> 3 aaronto01 944
#> 4 aasedo01 5
#> 5 abadan01 21
#> 6 abadfe01 9
#> 7 abadijo01 49
#> 8 abbated01 3044
#> 9 abbeybe01 225
#> 10 abbeych01 1756
#> # … with more rows
Finally, lahman_execute()
is for commands like CREATE
and DROP
that don't return a table, but rather execute a command on the database.
lahman_execute("CREATE TABLE Players AS
SELECT playerID, SUM(AB) AS AB
FROM Batting
GROUP BY playerID")
#> [1] 0
lahman_tbl("Players")
#> # Source: SQL [?? x 2]
#> # Database: sqlite 3.39.2 [/tmp/RtmpeApR4v/lahman.sqlite]
#> playerID AB
#> <chr> <int>
#> 1 aardsda01 4
#> 2 aaronha01 12364
#> 3 aaronto01 944
#> 4 aasedo01 5
#> 5 abadan01 21
#> 6 abadfe01 9
#> 7 abadijo01 49
#> 8 abbated01 3044
#> 9 abbeybe01 225
#> 10 abbeych01 1756
#> # … with more rows
lahman_execute("DROP TABLE Players")
#> [1] 0
Besides the _list
, _tbl
, _query
, and _execute
functions, the package also creates auto-completed table accessors.
# Same result as lahman_tbl("Batting")
lahman_batting()
#> # Source: SQL [?? x 22]
#> # Database: sqlite 3.39.2 [/tmp/RtmpeApR4v/lahman.sqlite]
#> playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
#> <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0 NA NA NA NA 0
#> 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0 NA NA NA NA 0
#> 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5 NA NA NA NA 1
#> 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2 NA NA NA NA 0
#> 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA 0
#> 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1 NA NA NA NA 0
#> 7 barkeal01 1871 1 RC1 NA 1 4 0 1 0 0 0 2 0 0 1 0 NA NA NA NA 0
#> 8 barnero01 1871 1 BS1 NA 31 157 66 63 10 9 0 34 11 6 13 1 NA NA NA NA 1
#> 9 barrebi01 1871 1 FW1 NA 1 5 1 1 1 0 0 1 0 0 0 0 NA NA NA NA 0
#> 10 barrofr01 1871 1 BS1 NA 18 86 13 13 2 1 0 11 1 0 0 0 NA NA NA NA 0
#> # … with more rows
# Same result as lahman_tbl("Managers") %>% count()
lahman_managers() %>%
count()
#> # Source: SQL [1 x 1]
#> # Database: sqlite 3.39.2 [/tmp/RtmpeApR4v/lahman.sqlite]
#> n
#> <int>
#> 1 3684
These are useful because they let you use auto-complete to complete table names as you're exploring a data source.
Please note that the 'dbcooper' project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.