/dgaFast

Multiple Systems Estimation Using Decomposable Graphical Models. This is an efficient re-implementation and extension of the dga R package.

Primary LanguageR

🎯 dgaFast: Multiple Systems Estimation Using Decomposable Graphical Models

R build status Lifecycle

Efficient re-implementation and extension of the dga package of James Johndrow, Kristian Lum and Patrick Ball (2015): "Performs capture-recapture estimation by averaging over decomposable graphical models. This approach builds on Madigan and York (1997)."

  • Higher performance is needed to account for linkage errors through linkage-averaging and for simulation studies.

  • Further plotting and posterior summarization functions have been added (bayesEstimator, posteriorMode, posteriorQuantiles, posteriorSummaryTable, adjMatrix, plotGraph, htmlSummary, latexSummary).

Note: the stratification functions and Venn diagram plotting functions from the dga package have not been reproduced in dgaFast. They can be accessed through install.packages("dga"); library(dga).

Example usage

Five lists example from Madigan and York (1997) as implemented in the dga package:

library(dgaFast) # Re-implements library(dga)

# Number of lists and prior hyperparameter
p <- 5
data(graphs5) # Decomposable graphical models on 5 lists.
delta <- 0.5
Nmissing <- 1:300 # Reasonable range for the number of unobserved individuals.

# Counts corresponding to list inclusion patterns.
Y <- c(0,27,37,19,4,4,1,1,97,22,37,25,2,1,3,5,83,36,34,18,3,5,0,2,30,5,23,8,0,3,0,2)
Y <- array(Y, dim=c(2,2,2,2,2))
N <- sum(Y) + Nmissing

# Model-wise posterior probaiblities on the total population size.
# weights[i,j] is the posterior probability for j missing individuals under model graphs5[[j]].
weights <- bma.cr(Y,  Nmissing, delta, graphs5)

# Plot of the posterior distribution.
plotPosteriorN(weights, N)

Table of top model estimates (see also dgaFast::latexSummary).

htmlSummary("./figures/posteriorSummary/summaryTable", weights, N, nrows=5, graphs=graphs5)
Model Posterior Prob. Bayes est. Mode 0.025 0.975
0.217 627 624 598 663
0.160 615 614 591 645
0.082 613 610 586 648
0.065 610 608 585 640
0.052 616 614 591 647

Performance gain

On a 2013 MacBook Pro 2.6 GHz Intel Core i5, the main routine of dgaFast is about 75 times faster than dga.

if (!require(pacman)) install.packages("pacman")
pacman::p_load(bench, dga)

bench::mark(
     dga::bma.cr(Y, Nmissing, delta, graphs5),
     dgaFast::bma.cr(Y, Nmissing, delta, graphs5), 
     min_iterations=10, check=FALSE)
expression min median itr/sec mem_alloc gc/sec
dga 866.8ms 919.6ms 0.994153 55.32MB 9.444453
dgaFast 11.3ms 12.5ms 76.286860 2.17MB 1.956073

Installation

From GitHub:

if (!require(devtools)) install.packages("devtools")
devtools::install_github("OlivierBinette/dgaFast")

References

  • James Johndrow, Kristian Lum and Patrick Ball (2015). dga: Capture-Recapture Estimation using Bayesian Model Averaging. R package version 1.2. https://CRAN.R-project.org/package=dga
  • David Madigan and Jeremy C. York (1997) Bayesian methods for estimation of the size of a closed population. Biometrika. Vol. 84, No. 1 (Mar., 1997), pp. 19-31
  • Mauricio Sadinle (2018) Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations. Annals of Applied Statistics Vol. 12 No. 2 pp. 1013-1038