/dm

Relational data models

Primary LanguageROtherNOASSERTION

Lifecycle: maturing Travis build status Codecov test coverage CRAN status Launch rstudio.cloud

dm

{dm} is an R package that provides tools for working with multiple related tables, stored as data frames or in a relational database.

Contents

Background

Relational databases and flat tables, like data frames or spreadsheets, present data in fundamentally different ways.

In data frames and spreadsheets, all data is presented together in one large table with many rows and columns. This means that the data is accessible in one location but has the disadvantage that the same values may be repeated multiple times, resulting in bloated tables with redundant data. In the worst case scenario, a data frame may have many rows and columns but only a single value different in each row.

Relational databases, on the other hand, do not keep all data together but split it into multiple smaller tables. That separation into sub-tables has several advantages:

  • all information is stored only once, avoiding repetition and conserving memory
  • all information is updated only once and in one place, improving consistency and avoiding errors that may result from updating the same value in multiple locations
  • all information is organized by topic and segmented into smaller tables that are easier to handle

Separation of data, thus, helps with data quality, and explains the continuing popularity of relational databases in production-level data management.

The downside of this approach is that it is harder to merge together information from different data sources and to identify which entities refer to the same object, a common task when modelling or plotting data. To be mapped uniquely, the entities would need to be designated as keys, and the separate tables collated together through a process called joining.

In R, there already exist packages that support handling inter-linked tables but the code is complex and requires multiple command sequences. The goal of the {dm} package is to simplify the data management processes in R while keeping the advantages of relational data models and the core concept of splitting one table into multiple tables. In this way, you can have the best of both worlds: manage your data as a collection of linked tables, then flatten multiple tables into one for an analysis with {dplyr} or other packages, on an as-needed basis.

Although {dm} is built upon relational data models, it is not a database itself. It can work transparently with both relational database systems and in-memory data, and copy data from and to databases.

Example

As an example, consider the nycflights13 dataset about the flights that departed New York City airports in 2013. The dataset contains five tables: the main flights table with links to the airlines, planes and airports tables, and the weather table without explicit links.

Assume that your task is to merge all tables, except the weather table.

In the classical {dplyr} notation, you would need three left_join() calls to merge the flights table gradually to the airlines, planes and airports tables to create one wide data frame:

library(tidyverse)
library(nycflights13)

flights %>%
  left_join(airlines, by = "carrier") %>%
  left_join(planes, by = "tailnum") %>%
  left_join(airports, by = c("origin" = "faa"))
#> # A tibble: 336,776 x 35
#>    year.x month   day dep_time sched_dep_time dep_delay arr_time
#>     <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1   2013     1     1      517            515         2      830
#>  2   2013     1     1      533            529         4      850
#>  3   2013     1     1      542            540         2      923
#>  4   2013     1     1      544            545        -1     1004
#>  5   2013     1     1      554            600        -6      812
#>  6   2013     1     1      554            558        -4      740
#>  7   2013     1     1      555            600        -5      913
#>  8   2013     1     1      557            600        -3      709
#>  9   2013     1     1      557            600        -3      838
#> 10   2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 28 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name.x <chr>, year.y <int>,
#> #   type <chr>, manufacturer <chr>, model <chr>, engines <int>,
#> #   seats <int>, speed <int>, engine <chr>, name.y <chr>, lat <dbl>,
#> #   lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

With the {dm} package, you would create a dm object. After that you would be able to use the links between the tables as often as you wish, without explicitly referring to the relations ever again.

For the example data set, you can use dm_nycflights13() to create the dm object, and a single command for merging the tables. The task of joining the four flights, airlines, planes and airports tables then boils down to:

library(dm)

dm_nycflights13() %>%
  dm_flatten_to_tbl(start = flights)
#> Renamed columns:
#> * year -> flights$flights.year, planes$planes.year
#> * name -> airlines$airlines.name, airports$airports.name
#> # A tibble: 11,227 x 35
#>    flights.year month   day dep_time sched_dep_time dep_delay arr_time
#>           <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1         2013     1    10        3           2359         4      426
#>  2         2013     1    10       16           2359        17      447
#>  3         2013     1    10      450            500       -10      634
#>  4         2013     1    10      520            525        -5      813
#>  5         2013     1    10      530            530         0      824
#>  6         2013     1    10      531            540        -9      832
#>  7         2013     1    10      535            540        -5     1015
#>  8         2013     1    10      546            600       -14      645
#>  9         2013     1    10      549            600       -11      652
#> 10         2013     1    10      550            600       -10      649
#> # … with 11,217 more rows, and 28 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, airlines.name <chr>,
#> #   airports.name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>,
#> #   dst <chr>, tzone <chr>, planes.year <int>, type <chr>,
#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> #   speed <int>, engine <chr>

The example data model for {nycflights13} is integrated in {dm} and defines primary and foreign keys to identify the common points between the tables. For data other than the example data, the dm object would need to be created by using the dm() constructor and by adding keys using dm_add_pk() and dm_add_fk().

Features

The {dm} package helps with the challenges that arise with working with relational data models.

Compound object

The dm class manages several related tables. It stores both the data and the metadata in a compound object.

  • data: a table source storing all tables
  • metadata: table names, column names, primary and foreign keys

This concept helps separating the join logic from the code: declare your relationships once, as part of your data, then use them in your code without repeating yourself.

Various operations on dm objects are implemented. They either affect the data (e.g., a filter), or the metadata (e.g., definition of keys), or both (e.g., creation of a new table).

Storage agnostic

The {dm} package augments {dplyr}/{dbplyr} workflows. Generally, if you can use {dplyr} on your data, it’s likely that you can use {dm} too. This includes local data frames, relational database systems, and many more.

Data preparation

A battery of utilities helps with creating a tidy relational data model.

  • Splitting and rejoining tables
  • Determining key candidates
  • Checking keys and cardinalities

Filtering and joining

Similarly to dplyr::filter(), a filtering function dm_filter() is available for dm objects. You need to provide the dm object, the table whose rows you want to filter, and the filter expression. The actual effect of the filtering will only be realized once you use dm_apply_filters. Before that, the filter conditions are merely stored within the dm. After using dm_apply_filters() a dm object is returned whose tables only contain rows that are related to the reduced rows in the filtered table. This currently only works for cycle-free relationships between the tables.

dm_nycflights13(cycle = FALSE) %>%
  dm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>       16     1458    11227     3322    26115

dm_nycflights13(cycle = FALSE) %>%
  dm_filter(planes, year == 2000, manufacturer == "BOEING") %>%
  dm_apply_filters() %>%
  dm_get_tables() %>%
  map_int(nrow)
#> airlines airports  flights   planes  weather 
#>        4        3      220      134    26115

For joining two tables using their relationship defined in the dm, you can use dm_join_to_tbl():

dm_nycflights13(cycle = FALSE) %>%
  dm_join_to_tbl(airports, flights, join = semi_join)
#> # A tibble: 11,227 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1    10        3           2359         4      426
#>  2  2013     1    10       16           2359        17      447
#>  3  2013     1    10      450            500       -10      634
#>  4  2013     1    10      520            525        -5      813
#>  5  2013     1    10      530            530         0      824
#>  6  2013     1    10      531            540        -9      832
#>  7  2013     1    10      535            540        -5     1015
#>  8  2013     1    10      546            600       -14      645
#>  9  2013     1    10      549            600       -11      652
#> 10  2013     1    10      550            600       -10      649
#> # … with 11,217 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

In our dm, the origin column of the flights table points to the airports table. Since all nycflights13-flights depart from New York, only these airports are included in the semi-join.

From and to databases

In order to transfer an existing dm object to a DB, you can call dm_copy_to() with the target DB and the dm object:

src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src:  sqlite 3.30.1 [:memory:]
#> tbls:
nycflights13_remote <- copy_dm_to(src_sqlite, dm_nycflights13(cycle = TRUE))
nycflights13_remote
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  sqlite 3.30.1 [:memory:]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 4

The key constraints from the original object are also copied to the newly created object. With the default setting set_key_constraints = TRUE for dm_copy_to(), key constraints are also established on the target DB. Currently this feature is only supported for MSSQL and Postgres database management systems (DBMS).

It is also possible to automatically create a dm object from the permanent tables of a DB. Again, for now just MSSQL and Postgres are supported for this feature, so the next chunk is not evaluated. The support for other DBMS will be implemented in a future update.

src_postgres <- src_postgres()
nycflights13_from_remote <- dm_learn_from_db(src_postgres)

Installation

The latest (development) version of {dm} can be installed from GitHub.

# install.packages("devtools")
devtools::install_github("krlmlr/dm")

The {dm} package will also be made available on CRAN, from where it can be installed with the command

install.packages("dm")

More information

For an introduction into relational data models and to jump-start working with {dm}, please see the article “Introduction to Relational Data Models”.

If you would like to learn more about {dm}, the Intro article is a good place to start.

Further resources:

Standing on the shoulders of giants

The {dm} package follows the tidyverse principles:

  • dm objects are immutable (your data will never be overwritten in place)
  • many functions used on dm objects are pipeable (i.e., return new dm objects)
  • tidy evaluation is used (unquoted function parameters are supported)

The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!

The {polyply} package has a similar intent with a slightly different interface.

The {data.cube} package has quite the same intent using array-like interface.

Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.

The {tidygraph} package stores a network as two related tables of nodes and edges, compatible with {dplyr} workflows.

In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.


License: MIT © cynkra GmbH.

Funded by:

energie360° cynkra


Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.