The goal of {dm} is to provide tools for working with multiple tables. Skip to the Features section if you are familiar with relational data models.
- Why? gives a short motivation
- Features gives a one-page overview over the scope of this package
- Example outlines some of the features in a short example
- More information offers links to more detailed articles
- Standing on the shoulders of giants shows related work
- Installation describes how to install the package
As an example, we consider the
nycflights13 dataset. This
dataset contains five tables: the main flights
table with links into
the airlines
, planes
and airports
tables, and a weather
table
without an explicit link.
The separation into multiple tables achieves several goals:
- Avoid repetition, conserve memory: the facts about each airline,
airport, and airplane are stored only once
- name of each airline
- name, location and altitude of each airport
- manufacturer and number of seats for each airplane
- Improve consistency: if facts (e.g. the name of an airport) need to be updated, they need to be updated in only one place
- Segmentation: facts are organized by topic, individual tables are smaller and easier to handle
The links are established through primary and foreign keys: a primary key identifies rows/tuples/observations of this table, foreign keys link to a row/tuple/observation in another table. See the Simple English Wikipedia article on database normalization for more detail.
The separation into multiple tables helps data quality but poses a different challenge: for each flight, the location of the origin airport, or the details on the airplane, are not available immediately but must be joined/merged:
library(tidyverse)
library(nycflights13)
flights %>%
select(month, day, origin, tailnum) %>%
left_join(airports, by = c("origin" = "faa")) %>%
left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 19
#> month day origin tailnum name lat lon alt tz dst tzone
#> <int> <int> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
#> 1 1 1 EWR N14228 Newa… 40.7 -74.2 18 -5 A Amer…
#> 2 1 1 LGA N24211 La G… 40.8 -73.9 22 -5 A Amer…
#> 3 1 1 JFK N619AA John… 40.6 -73.8 13 -5 A Amer…
#> 4 1 1 JFK N804JB John… 40.6 -73.8 13 -5 A Amer…
#> 5 1 1 LGA N668DN La G… 40.8 -73.9 22 -5 A Amer…
#> 6 1 1 EWR N39463 Newa… 40.7 -74.2 18 -5 A Amer…
#> 7 1 1 EWR N516JB Newa… 40.7 -74.2 18 -5 A Amer…
#> 8 1 1 LGA N829AS La G… 40.8 -73.9 22 -5 A Amer…
#> 9 1 1 JFK N593JB John… 40.6 -73.8 13 -5 A Amer…
#> 10 1 1 LGA N3ALAA La G… 40.8 -73.9 22 -5 A Amer…
#> # … with 336,766 more rows, and 8 more variables: year <int>, type <chr>,
#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> # speed <int>, engine <chr>
This package helps with many challenges that arise when working with relational data models.
The dm
class manages several related tables. It stores both the
data and the metadata in a compound object, and defines
operations on that object. These operations either affect the data
(e.g., a filter), or the metadata (e.g., definition of keys or creation
of a new table), or both.
- data: a table source storing all tables
- metadata: table names, column names, primary and foreign keys
This concept helps separating the join logic from the code: declare your relationships once, as part of your data, then use them in your code without repeating yourself.
The {dm} package augments {dplyr}/{dbplyr} workflows. Generally, if you can use {dplyr} on your data, it’s likely that you can use {dm} too. This includes local data frames, relational database systems, and many more.
A battery of utilities helps with creating a tidy relational data model.
- Splitting and rejoining tables
- Determining key candidates
- Checking keys and cardinalities
A readymade dm
object with preset keys is included in the package:
library(dm)
cdm_nycflights13()
#> ── Table source ───────────────────────────────────────────────────────────
#> src: <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#> 5 tables: airlines, airports, flights, planes ...
#> 53 columns
#> 3 primary keys
#> 4 references
#> ── Rows ───────────────────────────────────────────────────────────────────
#> Total: 367687
#> airlines: 16, airports: 1458, flights: 336776, planes: 3322, weather: 26115
The cdm_draw()
function creates a visualization of the entity
relationship model:
cdm_nycflights13(cycle = TRUE) %>%
cdm_draw()
Similarly to dplyr::filter()
, a filtering function cdm_filter()
is
available for dm
objects. You need to provide the dm
object, the
table whose rows you want to filter, and the filter expression. A dm
object is returned whose tables only contain rows that are related to
the reduced rows in the filtered table. This currently only works for
cycle-free relationships between the tables.
cdm_nycflights13(cycle = FALSE) %>%
cdm_get_tables() %>%
map_int(nrow)
#> airlines airports flights planes weather
#> 16 1458 336776 3322 26115
cdm_nycflights13(cycle = FALSE) %>%
cdm_filter(planes, year == 2000, manufacturer == "BOEING") %>%
cdm_get_tables() %>%
map_int(nrow)
#> airlines airports flights planes weather
#> 4 3 7301 134 26115
For joining two tables using their relationship defined in the dm
, you
can use cdm_join_tbl()
:
cdm_nycflights13(cycle = FALSE) %>%
cdm_join_tbl(airports, flights, join = semi_join)
#> # A tibble: 3 x 8
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
#> 1 EWR Newark Liberty Intl 40.7 -74.2 18 -5 A America/New_York
#> 2 JFK John F Kennedy Intl 40.6 -73.8 13 -5 A America/New_York
#> 3 LGA La Guardia 40.8 -73.9 22 -5 A America/New_York
In our dm
, the origin
column of the flights
table points to the
airports
table. Since all nycflights13
-flights depart from New York,
only these airports are included in the semi-join.
In order to transfer an existing dm
object to a DB, you can call
cdm_copy_to()
with the target DB and the dm
object:
src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src: sqlite 3.25.3 [:memory:]
#> tbls:
nycflights13_remote <- cdm_copy_to(src_sqlite, cdm_nycflights13(cycle = TRUE))
nycflights13_remote
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.25.3 [:memory:]
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#> 5 tables: airlines, airports, flights, planes ...
#> 53 columns
#> 3 primary keys
#> 4 references
#> ── Rows ───────────────────────────────────────────────────────────────────
#> Total: 367687
#> airlines: 16, airports: 1458, flights: 336776, planes: 3322, weather: 26115
The key constraints from the original object are also copied to the
newly created object. With the default setting set_key_constraints = TRUE
for cdm_copy_to()
, key constraints are also established on the
target DB. Currently this feature is only supported for MSSQL and
Postgres database management systems (DBMS).
It is also possible to automatically create a dm
object from the
permanent tables of a DB. Again, for now just MSSQL and Postgres are
supported for this feature, so the next chunk is not evaluated. The
support for other DBMS will be implemented in a future update.
src_postgres <- src_postgres()
nycflights13_from_remote <- cdm_learn_from_db(src_postgres)
If you would like to learn more about {dm}, the Intro article is a good place to start. Further resources:
- Function reference
- Class ‘dm’ and basic operations article
- Visualizing ‘dm’ objects article
- Low-level operations article
This package follows the tidyverse principles:
dm
objects are immutable (your data will never be overwritten in place)- many functions used on
dm
objects are pipeable (i.e., return newdm
objects) - tidy evaluation is used (unquoted function parameters are supported)
The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!
The {polyply} package has a similar intent with a slightly different interface.
The {data.cube} package has
quite the same intent using array
-like interface.
Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.
The {tidygraph} package stores
a network as two related tables of nodes
and edges
, compatible with
{dplyr} workflows.
In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.
Once on CRAN, the package can be installed with
install.packages("dm")
Install the latest development version with
# install.packages("devtools")
devtools::install_github("krlmlr/dm")
License: MIT © cynkra GmbH.
Funded by:
Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.