Building Summary Statistics Tables with `modelsummary` and `gtsummary`

modelsummary and gtsummary are two excellent r packages to build summary statistics. However, their syntax might not be fully intituitive if you are comming from Stata. Here are a couple of examples using these two packages.

First, let’s load the following packages and load our data:

library(tidyverse)
library(gtsummary)
library(modelsummary)
library(haven)

census <- read_dta("http://www.stata-press.com/data/r9/census.dta") %>%
  # Create dummy treatment
  mutate(
    rand = runif(n()),
    treatment = as.numeric(rand > 0.5)
  )

I’m using the census stata dta file for those who are familiar with this Stata dataset.

Model Summary

When it comes to model summary, we have two approaches: (1) a rapid data summary, and (2) a more elaborated one. For the former, we use the datasummary_skim() function as follows:

datasummary_skim(census)

	Unique (#)	Mean	SD	Min	Median	Max
region	4	2.7	1.1	1.0	3.0	4.0
pop	50	4518149.4	4715037.8	401851.0	3066433.0	23667902.0
poplt5	50	326277.8	331585.1	35998.0	227467.5	1708400.0
pop5_17	50	945951.6	959372.8	91796.0	629654.0	4680558.0
pop18p	50	3245920.1	3430531.3	271106.0	2175130.0	17278944.0
pop65p	50	509502.8	538932.4	11547.0	370495.0	2414250.0
popurban	50	3328253.2	4090177.9	172735.0	2156905.0	21607606.0
medage	37	29.5	1.7	24.2	29.8	34.7
death	50	39474.3	41742.3	1604.0	26176.5	186428.0
marriage	50	47701.4	45130.4	4437.0	36279.0	210864.0
divorce	50	23679.4	25094.0	2142.0	17112.5	133541.0
rand	50	0.5	0.3	0.0	0.5	1.0
treatment	2	0.5	0.5	0.0	0.0	1.0

If we want to select only a few variables, we could past a variables vector to the select function or create a new object with only the variables we need.

census %>%
  select(pop, death, marriage, divorce) %>%
  datasummary_skim()

	Unique (#)	Mean	SD	Min	Median	Max
pop	50	4518149.4	4715037.8	401851.0	3066433.0	23667902.0
death	50	39474.3	41742.3	1604.0	26176.5	186428.0
marriage	50	47701.4	45130.4	4437.0	36279.0	210864.0
divorce	50	23679.4	25094.0	2142.0	17112.5	133541.0

In addition, we can let the function knows if we would like to have only summary statistics for those variables that are either numeric or categorical, for example:

datasummary_skim(census, type = "numeric")

If we would like to have only the mean, sd, min, max instead of all the statistics that are presented using datasummary_skim we can use a 2-sided formula.

build <- pop + death + marriage + divorce ~ N + Mean + SD + Median + Min + Max 

## Without labels

datasummary(
  build,
  data = census
)

	N	Mean	SD	Median	Min	Max
pop	50	4518149.44	4715037.75	3066433.00	401851.00	23667902.00
death	50	39474.26	41742.35	26176.50	1604.00	186428.00
marriage	50	47701.40	45130.42	36279.00	4437.00	210864.00
divorce	50	23679.44	25094.01	17112.50	2142.00	133541.00

In the case of variables labels, we will need to modify those variables names.

## With labels

build <- `Population` + `Number of deaths` + `Number of marriages` + `Number of divorces` ~ N + Mean + SD + Median + Min + Max 

datasummary(
  build,
  data = census %>% 
    rename(`Population` = pop, `Number of deaths` = death, `Number of marriages` = marriage, `Number of divorces` = divorce)
)

	N	Mean	SD	Median	Min	Max
Population	50	4518149.44	4715037.75	3066433.00	401851.00	23667902.00
Number of deaths	50	39474.26	41742.35	26176.50	1604.00	186428.00
Number of marriages	50	47701.40	45130.42	36279.00	4437.00	210864.00
Number of divorces	50	23679.44	25094.01	17112.50	2142.00	133541.00

Finally, we can use the output argument to export our table to a several file formats.

build <- pop + death + marriage + divorce ~ N + Mean + SD + Median + Min + Max 

datasummary(
  build,
  data = census,
  output = "latex"
)

In the case of latex, your output would like this and you can use the \input command in your latex document to add your table to your reports or working papers:

\begin{table}
\centering
\begin{tabular}[t]{lrrrrrr}
\toprule
  & N & Mean & SD & Median & Min & Max\\
\midrule
pop & 50 & \num{4518149.44} & \num{4715037.75} & \num{3066433.00} & \num{401851.00} & \num{23667902.00}\\
death & 50 & \num{39474.26} & \num{41742.35} & \num{26176.50} & \num{1604.00} & \num{186428.00}\\
marriage & 50 & \num{47701.40} & \num{45130.42} & \num{36279.00} & \num{4437.00} & \num{210864.00}\\
divorce & 50 & \num{23679.44} & \num{25094.01} & \num{17112.50} & \num{2142.00} & \num{133541.00}\\
\bottomrule
\end{tabular}
\end{table}

Check the official official vignette for more examples.

GT Summary

gtsummary is another package that can be used for basic and complex summary statistics. Its syntax follows the gt family. For a basic summary statistics table, we can use the tbl_summary() function as follows:

vars <- c("pop", "death", "marriage", "divorce")

tab1 <- census %>% 
  select(all_of(vars)) %>% 
  tbl_summary()

By treatment variable:

tab2 <- census %>%
  select(all_of(vars), treatment) %>%
  tbl_summary(by = treatment) %>%
  add_p()

Given that we would like to have a more econ-paper type of descriptive statistics, we can pass the columns we would like to have in a vectorized way.

cols <- c(N = "{N_nonmiss}", Mean = "{mean} ({sd})", Median = "{median}", Min = "{min}", Max = "{max}")
tab3 <- cols %>% 
  # we would go through each of these columns
  imap(
    ~ census %>% 
      # and select the variables we need in our table
      select(all_of(vars)) %>% 
      tbl_summary(
        statistic = all_continuous() ~ .x
      ) %>% 
      # We will modify the title of cols headers
      modify_header(stat_0 ~ str_glue("{.y}"), label ~ "Variables") 
  ) %>% 
  # and merge every single of the columns into one single table
  tbl_merge() %>% 
  # remove spanning headers and footnote
  modify_spanning_header(everything() ~ NA) %>%
  modify_footnote(everything() ~ NA)

And, finally, we can use the as_kable_extra function to export our table to latex. The full example is here below:

tab3 %>%
  as_kable_extra(
    format = "latex", 
    booktabs = TRUE, 
    linesep = ""
  )

That gives you the following latex code:

\begin{tabular}{lccccc}
\toprule
Variables & N & Mean & Median & Min & Max\\
\midrule
Population & 50 & 4,518,149 (4,715,038) & 3,066,433 & 401,851 & 23,667,902\\
Number of deaths & 50 & 39,474 (41,742) & 26,176 & 1,604 & 186,428\\
Number of marriages & 50 & 47,701 (45,130) & 36,279 & 4,437 & 210,864\\
Number of divorces & 50 & 23,679 (25,094) & 17,112 & 2,142 & 133,541\\
\bottomrule
\end{tabular}

Check the official official vignette for more examples.

PDF Example

A compiled PDF example of some of the tables that were created here can be found here

RRMaximiliano/r-latex-tables-sum-stats

Building Summary Statistics Tables with modelsummary and gtsummary

Model Summary

GT Summary

PDF Example

Building Summary Statistics Tables with `modelsummary` and `gtsummary`