Bootstrap-CI-analysis

Comparison of different methods for confidence interval calculation.

Abstract
Methods
Results
Reproducibility

Abstract

Quantifying uncertainty is a vital part of every statistical study. There are many different methods, but in the hands of an inexperienced user, most of them can lead to big mistakes in the interpretation. Bootstrap is a favorable method for this task because of its robustness, versatility, ease of understanding and lack of stringent distributional assumptions. But even after 40 years of existence, it is not clear if this general method is accurate enough to substitute the traditional methods specialized for specific parameters of interest. To answer this, we designed an extensive simulation study that assesses the methods' confidence interval prediction for six different parameters on samples of multiple sizes, generated from seven diverse distributions. We chose the double bootstrap as the best general bootstrap method, additionally recommending the standard bootstrap for confidence intervals for extreme percentiles. We compared the best bootstrap methods to the traditional methods and found out that for almost all of the parameters no traditional method is practically better. Moreover, bootstrap gives good predictions even on the distributions where traditional methods fail because of broken assumptions. Our work thus suggests that estimates generated by the proposed bootstrap methods are comparable to or even better than the ones made by the traditional methods.

Methods

We compared the most commonly used bootstrap methods to traditional methods for confidence interval calculation. We then compared their accuracy and correctness, to see where choosing bootstrap over the traditional ones would be a mistake. For a more detailed description of methods for confidence interval estimation and experiment setup we refer the reader to the thesis available in folder thesis.

Methods for calculation of confidence intervals

We used the bootstrap methods implemented in the bootstrap-ci library: percentile, basic, standard, BC, BC_a, smoothed, studentized and double. Their description can be seen in the library's repository.

Experiment dimensions

To get the most general results possible, we compared all methods over many combinations of different DGP's, statistics, dataset sizes and coverage levels. We used the following distributions:

standard normal,
uniform from $0$ to $1$,
Laplace with $\mu = 0, b = 1$,
beta with $\alpha = 2, \beta = 10$,
exponential with $\lambda = 1$,
log-normal with $\mu = 0, \sigma = 1$
bi-normal with $\mu = [1, 1]^T$ and $\Sigma = [2, 0.5; 0.5, 1]$.

We used samples of sizes $n \in {4, 8, 16, 32, 64, 128, 256}$ randomly generated from these distributions to estimate confidence intervals for the mean, median, standard deviation, 5^th and 95^th percentile and correlation. We were interested in confidence levels $\alpha \in {0.025, 0.05, 0.25, 0.75, 0.95, 0.975}$.

Framework

To compare the methods we used two criteria: accuracy and correctness. Accuracy is the more important one, telling us how close the method's achieved coverage is to the desired one. If two methods achieve the same accuracy, we compared their correctness, which is calculated by the distance of each method's predicted confidence intervals to the exact intervals.

The study was done in three steps:

Choosing the best bootstrap method.
Comparing the best bootstrap method to all other methods (bootstrap and traditional ones), to see where another method gives better one-sided confidence interval estimations.
Repeating step 2. for two-sided confidence intervals.

Hierarchical bootstrap

We compared different strategies of the cases bootstrap based on their accuracy and ability to mimic the DGP's variational properties.

Results

More detailed results can again be found in the thesis, in chapter 4. In short, we answered to the above steps:

The best general bootstrap method is the double bootstrap. Additionally we recommend to use the standard bootstrap when estimating confidence intervals of extreme percentiles.
There is no method (bootstrap or traditional) that would have significantly better accuracy in most of the repetitions for experiments on any DGP. Only for the correlation, Fisher's method is equally accurate but more correct.
Results for two-sided intervals have the same conclusions.

Visualizations

True coverages for separate experiments and comparisons over several dimensions can be observed interactively on this site.

Hierarchical bootstrap

We recommend using the strategy that samples with replacement on all levels, as it has the best accuracy, and it best mimics the DGP's variational properties.

Reproducibility

Data generating processes used are implemented in the file generators.py, where you can also add your custom DGP, by extending the DGP class.

To get the results of all experiments, both for non-parametric and hierarchical case, you can run the file ci_comparison.py. You can change the desired DGP's, sample sizes, parameters, methods and confidence levels in the main function of the file to serve your interests.

You can visually compare all methods' accuracy and correctness over different experiments by running the function main_plot_comparison from results_visualizations.py file. If you want to plot the accuracy and correctness for each experiment together, run the function separate_experiment_plots from the same file.

The quantitative result analysis is done in file result_analysis.py. Results for first step of the analysis, selection of the best bootstrap method, can be obtained with function aggregate_results. Second step results are obtained using function analyze_experiments. For third step we used the same function only changing the value of the parameter sided to 'twosided'.

zrimseku/Bootstrap-CI-analysis

Bootstrap-CI-analysis

Table of contents

Abstract

Methods

Methods for calculation of confidence intervals

Experiment dimensions

Framework

Hierarchical bootstrap

Results

Visualizations

Hierarchical bootstrap

Reproducibility