ncn-foreigners/nonprobsvy

IPW and DR add information about totals

Closed this issue · 5 comments

For the IPW estimators (MLE, GEE) as well as DR we should also add information on the estimated totals based on the IPW weights, pop_totals (estimated or provided) and maybe diff between? Something that may give information on the quality of the weights.

For instance:

library(survey)
library(nonprobsvy)

set.seed(1234567890)
N <- 1e6 ## 1000000
n <- 1000
x1 <- rnorm(n = N, mean = 1, sd = 1)
x2 <- rexp(n = N, rate = 1)
epsilon <- rnorm(n = N) # rnorm(N)
y1 <- 1 + x1 + x2 + epsilon
y2 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
p1 <- exp(x2)/(1+exp(x2))
p2 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
flag_bd1 <- rbinom(n = N, size = 1, prob = p1)
flag_srs <- as.numeric(1:N %in% sample(1:N, size = n))
base_w_srs <- N/n
population <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs)
base_w_bd <- N/sum(population$flag_bd1)

sample_prob <- svydesign(ids= ~1, weights = ~ base_w_srs, 
                         data = subset(population, flag_srs == 1))
result_ipw <- nonprob(
  selection = ~ x1+x2,
  target = ~y1,
  data = subset(population, flag_bd1 == 1),
  svydesign = sample_prob
)

## estimated totals from the IPW
ipw_totals <- colSums(result_ipw$X[(n+1):nrow(result_ipw$X),]*result_ipw$weights)
## estimated totals from the population based on sample
prob_totals <- svytotal(~x1+x2, sample_prob)

##diff 
ipw_totals-c(sum(weights(sample_prob)), prob_totals)

which gives

> ipw_totals-c(sum(weights(sample_prob)), prob_totals)
(Intercept)          x1          x2 
 25687.1268  10626.6271   -397.9258 

and calibrated IPW

## calibrated
result_ipw_gee <- nonprob(
  selection = ~ x1+x2,
  target = ~y1,
  data = subset(population, flag_bd1 == 1),
  svydesign = sample_prob,
  control_selection = controlSel(est_method_sel = "gee", h = 1)
)

## estimated totals from the IPW
ipw_gee_totals <- colSums(result_ipw_gee$X[(n+1):nrow(result_ipw_gee$X),]*result_ipw_gee$weights)
## estimated totals from the population based on sample
prob_totals <- svytotal(~x1+x2, sample_prob)

##diff 
ipw_gee_totals-c(sum(weights(sample_prob)), prob_totals)

gives

> ipw_gee_totals-c(sum(weights(sample_prob)), prob_totals)
  (Intercept)            x1            x2 
 2.211891e-09 -6.845221e-08  1.979060e-09 

diff added to summary:

Estimation Quality:
(Intercept)          x1          x2 
   25687.15    26659.95    72281.64 

Can you provide the full output? Maybe a separate function will be better? Say check_balance? Something like that?

The whole output for now, I will work for switch it into an additional function

Call:
nonprob(data = subset(population, flag_bd1 == 1), selection = ~x1 + 
    x2, target = ~y1, svydesign = sample_prob)

-------------------------
Estimated population mean: 2.911 with overall std.err of: 0.07135
And std.err for nonprobability and probability samples being respectively:
0.002204 and 0.07132

95% Confidence inverval for popualtion mean:
   lower_bound upper_bound
y1    2.771215    3.050907


Based on: Inverse probability weighted method
For a population of estimate size: 1025687
Obtained on a nonprobability sample of size: 693011
With an auxiliary probability sample of size: 1000
-------------------------

Regression coefficients:
-----------------------
For glm regression on selection variable:
             Estimate Std. Error z value P(>|z|)    
(Intercept) -0.541689   0.004512  -120.1  <2e-16 ***
x1           0.040682   0.002450    16.6  <2e-16 ***
x2           1.888268   0.005307   355.8  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-------------------------

Weights:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.071   1.313   1.480   1.799   2.887 
-------------------------

Estimation balance:
(Intercept)          x1          x2 
   25687.15    26659.95    72281.64 
-------------------------

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.99999  0.06587  0.23759  0.26056  0.44387  0.65366 

AIC: 1010348
BIC: 1010382
Log-Likelihood: -505170.9 on 694008 Degrees of freedom

Can you provide an example for calibrated IPW (GEE)?

Call:
nonprob(data = subset(population, flag_bd1 == 1), selection = ~x1 + 
    x2, target = ~y1, svydesign = sample_prob, control_selection = controlSel(est_method_sel = "gee", 
    h = 1))

-------------------------
Estimated population mean: 2.95 with overall std.err of: 0.04205
And std.err for nonprobability and probability samples being respectively:
0.001869 and 0.04201

95% Confidence inverval for popualtion mean:
   lower_bound upper_bound
y1    2.867568    3.032399


Based on: Inverse probability weighted method
For a population of estimate size: 1e+06
Obtained on a nonprobability sample of size: 693011
With an auxiliary probability sample of size: 1000
-------------------------

Regression coefficients:
-----------------------
For glm regression on selection variable:
             Estimate Std. Error  z value P(>|z|)    
(Intercept) -0.328317   0.003162 -103.845  <2e-16 ***
x1          -0.004752   0.001803   -2.636  0.0084 ** 
x2           1.671378   0.004376  381.948  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-------------------------

Weights:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.086   1.320   1.443   1.734   2.418 
-------------------------

Estimation balance:
  (Intercept)            x1            x2 
 1.979060e-09 -6.821938e-08  1.746230e-09 
-------------------------

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.99998  0.07888  0.24199  0.25508  0.42311  0.58636 

AIC: NA
BIC: NA
Log-Likelihood: NA on 694008 Degrees of freedom