{dm} is an R package that provides tools for working with multiple related tables, stored as data frames or in a relational database.
Contents
- Background explains the philosophy behind the package and the problem that it tries to solve
- Example illustrates the general concept with a simple example
- Features gives an overview of the scope of the package
- Installation describes how to install the package
- More information lists more detailed documentation sources
- Standing on the shoulders of giants credits related work
Relational databases and flat tables, like data frames or spreadsheets, present data in fundamentally different ways.
In data frames and spreadsheets, all data is presented together in one large table with many rows and columns. This means that the data is accessible in one location but has the disadvantage that the same values may be repeated multiple times, resulting in bloated tables with redundant data. In the worst case scenario, a data frame may have many rows and columns but only a single value different in each row.
Relational databases, on the other hand, do not keep all data together but split it into multiple smaller tables. That separation into sub-tables has several advantages:
- all information is stored only once, avoiding repetition and conserving memory
- all information is updated only once and in one place, improving consistency and avoiding errors that may result from updating the same value in multiple locations
- all information is organized by topic and segmented into smaller tables that are easier to handle
Separation of data, thus, helps with data quality, and explains the continuing popularity of relational databases in production-level data management.
The downside of this approach is that it is harder to merge together information from different data sources and to identify which entities refer to the same object, a common task when modelling or plotting data. To be mapped uniquely, the entities would need to be designated as keys, and the separate tables collated together through a process called joining.
In R, there already exist packages that support handling inter-linked tables but the code is complex and requires multiple command sequences. The goal of the {dm} package is to simplify the data management processes in R while keeping the advantages of relational data models and the core concept of splitting one table into multiple tables. In this way, you can have the best of both worlds: manage your data as a collection of linked tables, then flatten multiple tables into one for an analysis with {dplyr} or other packages, on an as-needed basis.
Although {dm} is built upon relational data models, it is not a database itself. It can work transparently with both relational database systems and in-memory data, and copy data from and to databases.
As an example, consider the
nycflights13
dataset about
the flights that departed New York City airports in 2013. The dataset
contains five tables: the main flights
table with links to the
airlines
, planes
and airports
tables, and the weather
table
without explicit links.
Assume that your task is to merge all tables, except the weather
table.
In the classical {dplyr} notation, you
would need three left_join()
calls to merge the flights
table
gradually to the airlines
, planes
and airports
tables to create
one wide data frame:
library(tidyverse)
library(nycflights13)
flights %>%
left_join(airlines, by = "carrier") %>%
left_join(planes, by = "tailnum") %>%
left_join(airports, by = c("origin" = "faa"))
#> # A tibble: 336,776 x 35
#> year.x month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
#> 7 2013 1 1 555 600 -5 913
#> 8 2013 1 1 557 600 -3 709
#> 9 2013 1 1 557 600 -3 838
#> 10 2013 1 1 558 600 -2 753
#> # … with 336,766 more rows, and 28 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, name.x <chr>, year.y <int>,
#> # type <chr>, manufacturer <chr>, model <chr>, engines <int>,
#> # seats <int>, speed <int>, engine <chr>, name.y <chr>, lat <dbl>,
#> # lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
With the {dm} package, you would create a dm
object.
After that you would be able to use the links between the tables as
often as you wish, without explicitly referring to the relations ever
again.
For the example data set, you can use dm_nycflights13()
to create the
dm
object, and a single command for merging the tables. The task of
joining the four flights
, airlines
, planes
and airports
tables
then boils down to:
library(dm)
dm_nycflights13() %>%
dm_flatten_to_tbl(start = flights)
#> Renamed columns:
#> * year -> flights$flights.year, planes$planes.year
#> * name -> airlines$airlines.name, airports$airports.name
#> # A tibble: 11,227 x 35
#> flights.year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # … with 11,217 more rows, and 28 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, airlines.name <chr>,
#> # airports.name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>,
#> # dst <chr>, tzone <chr>, planes.year <int>, type <chr>,
#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> # speed <int>, engine <chr>
The example data model for {nycflights13} is integrated in {dm} and
defines primary and foreign keys to identify the common points between
the tables. For data other than the example data, the dm
object would
need to be created by using the dm()
constructor and by adding keys
using dm_add_pk()
and dm_add_fk()
.
The {dm} package helps with the challenges that arise with working with relational data models.
The dm
class manages several related tables. It stores both the
data and the metadata in a compound object.
- data: a table source storing all tables
- metadata: table names, column names, primary and foreign keys
This concept helps separating the join logic from the code: declare your relationships once, as part of your data, then use them in your code without repeating yourself.
Various operations on dm
objects are implemented. They either affect
the data (e.g., a filter), or the metadata (e.g., definition of keys),
or both (e.g., creation of a new table).
The {dm} package augments {dplyr}/{dbplyr} workflows. Generally, if you can use {dplyr} on your data, it’s likely that you can use {dm} too. This includes local data frames, relational database systems, and many more.
A battery of utilities helps with creating a tidy relational data model.
- Splitting and rejoining tables
- Determining key candidates
- Checking keys and cardinalities
Similarly to dplyr::filter()
, a filtering function dm_filter()
is
available for dm
objects. You need to provide the dm
object, the
table whose rows you want to filter, and the filter expression. The
actual effect of the filtering will only be realized once you use
dm_apply_filters
. Before that, the filter conditions are merely stored
within the dm
. After using dm_apply_filters()
a dm
object is
returned whose tables only contain rows that are related to the reduced
rows in the filtered table. This currently only works for cycle-free
relationships between the tables.
dm_nycflights13(cycle = FALSE) %>%
dm_get_tables() %>%
map_int(nrow)
#> airlines airports flights planes weather
#> 16 1458 11227 3322 26115
dm_nycflights13(cycle = FALSE) %>%
dm_filter(planes, year == 2000, manufacturer == "BOEING") %>%
dm_apply_filters() %>%
dm_get_tables() %>%
map_int(nrow)
#> airlines airports flights planes weather
#> 4 3 220 134 26115
For joining two tables using their relationship defined in the dm
, you
can use dm_join_to_tbl()
:
dm_nycflights13(cycle = FALSE) %>%
dm_join_to_tbl(airports, flights, join = semi_join)
#> # A tibble: 11,227 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # … with 11,217 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
In our dm
, the origin
column of the flights
table points to the
airports
table. Since all nycflights13
-flights depart from New York,
only these airports are included in the semi-join.
In order to transfer an existing dm
object to a DB, you can call
dm_copy_to()
with the target DB and the dm
object:
src_sqlite <- src_sqlite(":memory:", create = TRUE)
src_sqlite
#> src: sqlite 3.30.1 [:memory:]
#> tbls:
nycflights13_remote <- copy_dm_to(src_sqlite, dm_nycflights13(cycle = TRUE))
nycflights13_remote
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.30.1 [:memory:]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 3
#> Foreign keys: 4
The key constraints from the original object are also copied to the
newly created object. With the default setting set_key_constraints = TRUE
for dm_copy_to()
, key constraints are also established on the
target DB. Currently this feature is only supported for MSSQL and
Postgres database management systems (DBMS).
It is also possible to automatically create a dm
object from the
permanent tables of a DB. Again, for now just MSSQL and Postgres are
supported for this feature, so the next chunk is not evaluated. The
support for other DBMS will be implemented in a future update.
src_postgres <- src_postgres()
nycflights13_from_remote <- dm_learn_from_db(src_postgres)
The latest (development) version of {dm} can be installed from GitHub.
# install.packages("devtools")
devtools::install_github("krlmlr/dm")
The {dm} package will also be made available on CRAN, from where it can be installed with the command
install.packages("dm")
For an introduction into relational data models and to jump-start working with {dm}, please see the article “Introduction to Relational Data Models”.
If you would like to learn more about {dm}, the Intro article is a good place to start.
Further resources:
- Function reference
- Introduction to Relational Data Models article
- Joining article
- Filtering article
- Class ‘dm’ and basic operations article
- Visualizing ‘dm’ objects article
- Low-level operations article
The {dm} package follows the tidyverse principles:
dm
objects are immutable (your data will never be overwritten in place)- many functions used on
dm
objects are pipeable (i.e., return newdm
objects) - tidy evaluation is used (unquoted function parameters are supported)
The {dm} package builds heavily upon the {datamodelr} package, and upon the tidyverse. We’re looking forward to a good collaboration!
The {polyply} package has a similar intent with a slightly different interface.
The {data.cube} package has
quite the same intent using array
-like interface.
Articles in the {rquery} package discuss join controllers and join dependency sorting, with the intent to move the declaration of table relationships from code to data.
The {tidygraph} package stores
a network as two related tables of nodes
and edges
, compatible with
{dplyr} workflows.
In object-oriented programming languages, object-relational mapping is a similar concept that attempts to map a set of related tables to a class hierarchy.
License: MIT © cynkra GmbH.
Funded by:
Please note that the ‘dm’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.