Implementation of algorithms that extend Iterative Proportional Fitting (IPF) to nested structures.
The IPF algorithm operates on count data. This package offers implementations for several algorithms that extend this to nested structures: “parent” and “child” items for both of which constraints can be provided.
Install from CRAN with:
install.packages("mlfit")
Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("mlfit/mlfit")
Here is a multi-level fitting example with a reference sample
(reference_sample
) and two control tables (individual_control
and
group_control
). Each row of reference_sample
represents an
individual in a sample of a population, where HHNR
is their group ID
and PNR
is their individual ID, APER
and WKSTAT
are
individial-level charateristics, and CAR
is the only household
characteristic of the sample population. The ‘N’ columns in both control
tables denote how many units of individuals or groups belong to each
category.
library(mlfit)
library(tibble)
reference_sample <- tibble::tribble(
~HHNR, ~PNR, ~APER, ~CAR, ~WKSTAT,
1L, 1L, 3L, "0", "1",
1L, 2L, 3L, "0", "2",
1L, 3L, 3L, "0", "3",
2L, 4L, 2L, "0", "1",
2L, 5L, 2L, "0", "3",
3L, 6L, 3L, "0", "1",
3L, 7L, 3L, "0", "1",
3L, 8L, 3L, "0", "2",
4L, 9L, 3L, "1", "1",
4L, 10L, 3L, "1", "3",
4L, 11L, 3L, "1", "3",
5L, 12L, 3L, "1", "2",
5L, 13L, 3L, "1", "2",
5L, 14L, 3L, "1", "3",
6L, 15L, 2L, "1", "1",
6L, 16L, 2L, "1", "2",
7L, 17L, 5L, "1", "1",
7L, 18L, 5L, "1", "1",
7L, 19L, 5L, "1", "2",
7L, 20L, 5L, "1", "3",
7L, 21L, 5L, "1", "3",
8L, 22L, 2L, "1", "1",
8L, 23L, 2L, "1", "2"
)
individual_control <- tibble::tribble(
~WKSTAT, ~N,
"1", 91L,
"2", 65L,
"3", 104L
)
group_control <- tibble::tribble(
~CAR, ~N,
"0", 35L,
"1", 65L
)
First we need to create a ml_problem
object which defines our
multi-level fitting problem. special_field_names()
is useful for the
field_names
argument to ml_problem()
, this is where we need to
specific the names of the ID columns in our reference sample and the
count column in the control tables.
fitting_problem <- ml_problem(
ref_sample = reference_sample,
controls = list(
individual = list(individual_control),
group = list(group_control)
),
field_names = special_field_names(
groupId = "HHNR",
individualId = "PNR",
count = "N"
)
)
You can use one of the ml_fit_*()
functions to calibrate your fitting
problem, or you can use
ml_fit(ml_problem, algorithm = "<your-selected-algorithm>")
.
fit <- ml_fit(ml_problem = fitting_problem, algorithm = "ipu")
fit
#> An object of class ml_fit
#> Algorithm: ipu
#> Success: TRUE
#> Residuals (absolute): min = -6.41906e-05, max = 0
#> Flat problem:
#> An object of class flat_ml_fit_problem
#> Dimensions: 5 groups, 8 target values
#> Model matrix type: separate
#> Original fitting problem:
#> An object of class ml_problem
#> Reference sample: 23 observations
#> Control totals: 1 at individual, and 1 at group level
mlfit
also provides a function that helps to replicate the reference
sample based on the fitted/calibrated weights. See ?ml_replicate
to
find out which integerisation algorithms are available.
syn_pop <- ml_replicate(fit, algorithm = "trs")
syn_pop
#> # A tibble: 259 x 5
#> HHNR PNR APER CAR WKSTAT
#> <int> <int> <int> <chr> <chr>
#> 1 1 1 3 0 1
#> 2 1 2 3 0 2
#> 3 1 3 3 0 3
#> 4 2 4 3 0 1
#> 5 2 5 3 0 2
#> 6 2 6 3 0 3
#> 7 3 7 2 0 1
#> 8 3 8 2 0 3
#> 9 4 9 2 0 1
#> 10 4 10 2 0 3
#> # ... with 249 more rows
This example is almost identical to the previous example, except we are
creating sub-fitting problems based on zones. ml_problem()
has the
geo_hierarchy
argument, where it lets you specify a geographical
hierarchy, a data.frame
with two columns: region
and zone
. To put
it simply, a zone can only belong to one region. The image below shows
an example of that, where the orange patch is a zone that is within the
green region.
When geo_hierarchy
is validly specified, ml_problem()
would return a
list of fitting problems, one fitting problem per zone. Each fitting
problem will contain only relevant subsets of the reference sample and
control totals for its zone. Basically, the reference sample is a
population survey sample taken at a regional level and the control
totals should be at a zonal level.
ref_sample <- tibble::tribble(
~HHNR, ~PNR, ~APER, ~HH_VAR, ~P_VAR, ~REGION,
1, 1, 3, 1, 1, 1,
1, 2, 3, 1, 2, 1,
1, 3, 3, 1, 3, 1,
2, 4, 2, 1, 1, 1,
2, 5, 2, 1, 3, 1,
3, 6, 3, 1, 1, 1,
3, 7, 3, 1, 1, 1,
3, 8, 3, 1, 2, 1,
4, 9, 3, 2, 1, 1,
4, 10, 3, 2, 3, 1,
4, 11, 3, 2, 3, 1,
5, 12, 3, 2, 2, 1,
5, 13, 3, 2, 2, 1,
5, 14, 3, 2, 3, 1,
6, 15, 2, 2, 1, 1,
6, 16, 2, 2, 2, 1,
7, 17, 5, 2, 1, 1,
7, 18, 5, 2, 1, 1,
7, 19, 5, 2, 2, 1,
7, 20, 5, 2, 3, 1,
7, 21, 5, 2, 3, 1,
8, 22, 2, 2, 1, 1,
8, 23, 2, 2, 2, 1,
9, 24, 3, 1, 1, 2,
9, 25, 3, 1, 2, 2,
9, 26, 3, 1, 3, 2,
10, 27, 2, 1, 1, 2,
10, 28, 2, 1, 3, 2,
11, 29, 3, 1, 1, 2,
11, 30, 3, 1, 1, 2,
11, 31, 3, 1, 2, 2,
12, 32, 3, 2, 1, 2,
12, 33, 3, 2, 3, 2,
12, 34, 3, 2, 3, 2,
13, 35, 3, 2, 2, 2,
13, 36, 3, 2, 2, 2,
13, 37, 3, 2, 3, 2,
14, 38, 2, 2, 1, 2,
14, 39, 2, 2, 2, 2,
15, 40, 5, 2, 1, 2,
15, 41, 5, 2, 1, 2,
15, 42, 5, 2, 2, 2,
15, 43, 5, 2, 3, 2,
15, 44, 5, 2, 3, 2,
16, 45, 2, 2, 1, 2,
16, 46, 2, 2, 2, 2
)
hh_ctrl <- tibble::tribble(
~ZONE, ~HH_VAR, ~N,
1, 1, 35,
1, 2, 65,
2, 1, 35,
2, 2, 65,
3, 1, 35,
3, 2, 65,
4, 1, 35,
4, 2, 65
)
ind_ctrl <- tibble::tribble(
~ZONE, ~P_VAR, ~N,
1, 1, 91,
1, 2, 65,
1, 3, 104,
2, 1, 91,
2, 2, 65,
2, 3, 104,
3, 1, 91,
3, 2, 65,
3, 3, 104,
4, 1, 91,
4, 2, 65,
4, 3, 104
)
geo_hierarchy <- tibble::tribble(
~REGION, ~ZONE,
1, 1,
1, 2,
2, 3,
2, 4
)
fitting_problems <- ml_problem(
ref_sample = ref_sample,
field_names = special_field_names(
groupId = "HHNR", individualId = "PNR", count = "N",
zone = "ZONE", region = "REGION"
),
group_controls = list(hh_ctrl),
individual_controls = list(ind_ctrl),
geo_hierarchy = geo_hierarchy
)
#> Creating a list of fitting problems by zone
fits <- fitting_problems %>%
lapply(ml_fit, algorithm = "ipu") %>%
lapply(ml_replicate, algorithm = "trs")
grake
: A reimplementation of generalized raking (Deville and Särndal, 1992; Deville, Särndal and Sautory, 1993)
wrswoR
: An implementation of fast weighted random sampling without replacement (Efraimidis and Spirakis, 2006)mangow
: Embed the Gower distance metric in L1RANN.L1
: k-nearest neighbors using the L1 metric
From version 0.4.0
onwards the package is now to be known as mlfit
.
If you would like to install any version that is older than 0.4.0
please use:
# See https://github.com/mlfit/mlfit/releases for the releases that are available
# To install a certain branch or commit or tag, append it to the repo name, after an @:
devtools::install_github("mlfit/mlfit@v0.3-7")
Note that, all versions prior to 0.4.0
should be used as
MultiLeveLIPF
not mlfit
.
To cite package ‘mlfit’ in publications use:
Kirill Müller and Amarin Siripanich (2021). mlfit: Iterative Proportional Fitting Algorithms for Nested Structures. https://mlfit.github.io/mlfit/, https://github.com/mlfit/mlfit.
A BibTeX entry for LaTeX users is
@Manual{,
title = {mlfit: Iterative Proportional Fitting Algorithms for Nested Structures},
author = {Kirill Müller and Amarin Siripanich},
year = {2021},
note = {https://mlfit.github.io/mlfit/, https://github.com/mlfit/mlfit},
}
- Casati, D., Müller, K., Fourie, P. J., Erath, A., & Axhausen, K. W. (2015). Synthetic population generation by combining a hierarchical, simulation-based approach with reweighting by generalized raking. Transportation Research Record, 2493(1), 107-116.
- Bösch, P. M., Müller, K., & Ciari, F. (2016). The IVT 2015 baseline scenario. In 16th Swiss Transport Research Conference (STRC 2016). 16th Swiss Transport Research Conference (STRC 2016).
- Müller, K. (2017). A generalized approach to population synthesis (Doctoral dissertation, ETH Zurich).
- Ilahi, A., & Axhausen, K. W. (2018). Implementing Bayesian network and generalized raking multilevel IPF for constructing population synthesis in megacities. In 18th Swiss Transport Research Conference (STRC 2018). STRC.
- Ilahi, A., & Axhausen, K. W. (2019). Integrating Bayesian network and generalized raking for population synthesis in Greater Jakarta. Regional Studies, Regional Science, 6(1), 623-636.
- Yameogo, B. F., Vandanjon, P. O., Gastineau, P., & Hankach, P. (2021). Generating a two-layered synthetic population for French municipalities: Results and evaluation of four synthetic reconstruction methods. JASSS-Journal of Artificial Societies and Social Simulation, 24, 27p.
- Zhou, M., Li, J., Basu, R., & Ferreira, J. (2022). Creating spatially-detailed heterogeneous synthetic populations for agent-based microsimulation. Computers, Environment and Urban Systems, 91, 101717.