/dm

Relational data models

Primary LanguageROtherNOASSERTION

Lifecycle: experimental Travis build status Codecov test coverage CRAN status

dm

The goal of {dm} is to provide tools for working with multiple tables. Skip to the Features section if you are familiar with relational data models.

Why?

As an example, we consider the nycflights13 dataset. This dataset contains five tables: the main flights table with links into the airlines, planes and airports tables, and a weather table without an explicit link.

The separation into multiple tables achieves several goals:

  • Avoid repetition, conserve memory: the facts about each airline, airport, and airplane are stored only once
    • name of each airline
    • name, location and altitude of each airport
    • manufacturer and number of seats for each airplane
  • Improve consistency: if facts (e.g. the name of an airport) need to be updated, they need to be updated in only one place
  • Segmentation: facts are organized by topic, individual tables are smaller and easier to handle

The links are established through primary and foreign keys: a primary key identifies rows/tuples/observations of this table, foreign keys link to a row/tuple/observation in another table. See the Simple English Wikipedia article on database normalization for more detail.

Joining

The separation into multiple tables helps data quality but poses a different challenge: for each flight, the location of the origin airport, or the details on the airplane, are not available immediately but must be joined/merged:

library(tidyverse)
library(nycflights13)

flights %>% 
  select(month, day, origin, tailnum) %>% 
  left_join(airports, by = c("origin" = "faa")) %>% 
  left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 19
#>    month   day origin tailnum name    lat   lon   alt    tz dst   tzone
#>    <int> <int> <chr>  <chr>   <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
#>  1     1     1 EWR    N14228  Newa…  40.7 -74.2    18    -5 A     Amer…
#>  2     1     1 LGA    N24211  La G…  40.8 -73.9    22    -5 A     Amer…
#>  3     1     1 JFK    N619AA  John…  40.6 -73.8    13    -5 A     Amer…
#>  4     1     1 JFK    N804JB  John…  40.6 -73.8    13    -5 A     Amer…
#>  5     1     1 LGA    N668DN  La G…  40.8 -73.9    22    -5 A     Amer…
#>  6     1     1 EWR    N39463  Newa…  40.7 -74.2    18    -5 A     Amer…
#>  7     1     1 EWR    N516JB  Newa…  40.7 -74.2    18    -5 A     Amer…
#>  8     1     1 LGA    N829AS  La G…  40.8 -73.9    22    -5 A     Amer…
#>  9     1     1 JFK    N593JB  John…  40.6 -73.8    13    -5 A     Amer…
#> 10     1     1 LGA    N3ALAA  La G…  40.8 -73.9    22    -5 A     Amer…
#> # … with 336,766 more rows, and 8 more variables: year <int>, type <chr>,
#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> #   speed <int>, engine <chr>

Features

This package helps with many challenges that arise when working with relational data models.

Compound object

The dm class manages several related tables. It stores both the data and the metadata in a compound object, and defines operations on that object. These operations either affect the data (e.g., a filter), or the metadata (e.g., definition of keys or creation of a new table), or both.

  • data: a table source storing all tables
  • metadata: table names, column names, primary and foreign keys

This concept helps separating the join logic from the code: declare your relationships once, as part of your data, then use them in your code without repeating yourself.

Storage agnostic

The {dm} package augments {dplyr}/{dbplyr} workflows. Generally, if you can use {dplyr} on your data, it’s likely that you can use {dm} too. This includes local data frames, relational database systems, and many more.

Data preparation

A battery of utilities helps with creating a tidy relational data model.

  • Splitting and rejoining tables
  • Determining key candidates
  • Checking keys and cardinalities

Example

A readymade dm object with preset keys is included in the package:

library(dm)

cdm_nycflights13()
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <package: nycflights13>
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   3 primary keys
#>   4 references
#> ── Rows ───────────────────────────────────────────────────────────────────
#> Total: 367687
#> airlines: 16, airports: 1458, flights: 336776, planes: 3322, weather: 26115

The cdm_draw() function creates a visualization of the entity relationship model:

cdm_nycflights13(cycle = TRUE) %>% 
  cdm_draw()

Filtering and joining

Similarly to dplyr::filter(), a filtering function cdm_filter() is available for dm objects. You need to provide the dm object, the table whose rows you want to filter, and the filter expression. A dm object is returned whose tables only contain rows that are related to the reduced rows in the filtered table. This currently only works for cycle-free relationships between the tables.

cdm_nycflights13(cycle = FALSE) %>%
  cdm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>       16     1458   336776     3322    26115

cdm_nycflights13(cycle = FALSE) %>% 
  cdm_filter(planes, year == 2000, manufacturer == "BOEING") %>%
  cdm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>        4        3     7301      134    26115

For joining two tables using their relationship defined in the dm, you can use cdm_join_tbl():

cdm_nycflights13(cycle = FALSE) %>%
  cdm_join_tbl(airports, flights, join = semi_join)
#> # A tibble: 3 x 8
#>   faa   name                  lat   lon   alt    tz dst   tzone           
#>   <chr> <chr>               <dbl> <dbl> <int> <dbl> <chr> <chr>           
#> 1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
#> 2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York
#> 3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York

In our dm, the origin column of the flights table points to the airports table. Since all nycflights13-flights depart from New York, only these airports are included in the semi-join.

From and to databases

In order to transfer an existing dm object to a DB, you can call cdm_copy_to() with the target DB and the dm object:

src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src:  sqlite 3.25.3 [:memory:]
#> tbls:
nycflights13_remote <- cdm_copy_to(src_sqlite, cdm_nycflights13(cycle = TRUE))
nycflights13_remote
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  sqlite 3.25.3 [:memory:]
#> ── Data model ─────────────────────────────────────────────────────────────
#> Data model object:
#>   5 tables:  airlines, airports, flights, planes ... 
#>   53 columns
#>   3 primary keys
#>   4 references
#> ── Rows ───────────────────────────────────────────────────────────────────
#> Total: 367687
#> airlines: 16, airports: 1458, flights: 336776, planes: 3322, weather: 26115

The key constraints from the original object are also copied to the newly created object. With the default setting set_key_constraints = TRUE for cdm_copy_to(), key constraints are also established on the target DB. Currently this feature is only supported for MSSQL and Postgres database management systems (DBMS).

It is also possible to automatically create a dm object from the permanent tables of a DB. Again, for now just MSSQL and Postgres are supported for this feature, so the next chunk is not evaluated. The support for other DBMS will be implemented in a future update.

src_postgres <- src_postgres()
nycflights13_from_remote <- cdm_learn_from_db(src_postgres)

More information

If you would like to learn more about {dm}, the Intro article is a good place to start. Further resources:

Standing on the shoulders of giants

This package follows the tidyverse principles:

  • dm objects are immutable (your data will never be overwritten in place)
  • many functions used on dm objects are pipeable (i.e., return new dm objects)
  • tidy evaluation is used (unquoted function parameters are supported)

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!

The {polyply} package has a similar intent with a slightly different interface.

The {data.cube} package has quite the same intent using array-like interface.

Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.

The {tidygraph} package stores a network as two related tables of nodes and edges, compatible with {dplyr} workflows.

In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.

Installation

Once on CRAN, the package can be installed with

install.packages("dm")

Install the latest development version with

# install.packages("devtools")
devtools::install_github("krlmlr/dm")

License: MIT © cynkra GmbH.

Funded by:

energie360° cynkra


Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.