Comparison of different methods for confidence interval calculation.
Quantifying uncertainty is a vital part of every statistical study. There are many different methods, but in the hands of an inexperienced user, most of them can lead to big mistakes in the interpretation. Bootstrap is a favorable method for this task because of its robustness, versatility, ease of understanding and lack of stringent distributional assumptions. But even after 40 years of existence, it is not clear if this general method is accurate enough to substitute the traditional methods specialized for specific parameters of interest. To answer this, we designed an extensive simulation study that assesses the methods' confidence interval prediction for six different parameters on samples of multiple sizes, generated from seven diverse distributions. We chose the double bootstrap as the best general bootstrap method, additionally recommending the standard bootstrap for confidence intervals for extreme percentiles. We compared the best bootstrap methods to the traditional methods and found out that for almost all of the parameters no traditional method is practically better. Moreover, bootstrap gives good predictions even on the distributions where traditional methods fail because of broken assumptions. Our work thus suggests that estimates generated by the proposed bootstrap methods are comparable to or even better than the ones made by the traditional methods.
We compared the most commonly used bootstrap methods to traditional methods for confidence interval
calculation.
We then compared their accuracy and correctness, to see where choosing bootstrap over the traditional ones
would be a mistake.
For a more detailed description of methods for confidence interval estimation and experiment setup we refer the reader
to the thesis available in folder thesis
.
We used the bootstrap methods implemented in the bootstrap-ci
library: percentile, basic, standard, BC,
BC_a, smoothed, studentized and double.
Their description can be seen in the library's
repository.
To get the most general results possible, we compared all methods over many combinations of different DGP's, statistics, dataset sizes and coverage levels. We used the following distributions:
- standard normal,
- uniform from
$0$ to$1$ , - Laplace with
$\mu = 0, b = 1$ , - beta with
$\alpha = 2, \beta = 10$ , - exponential with
$\lambda = 1$ , - log-normal with
$\mu = 0, \sigma = 1$ - bi-normal with
$\mu = [1, 1]^T$ and$\Sigma = [2, 0.5; 0.5, 1]$ .
We used samples of sizes
To compare the methods we used two criteria: accuracy and correctness. Accuracy is the more important one, telling us how close the method's achieved coverage is to the desired one. If two methods achieve the same accuracy, we compared their correctness, which is calculated by the distance of each method's predicted confidence intervals to the exact intervals.
The study was done in three steps:
- Choosing the best bootstrap method.
- Comparing the best bootstrap method to all other methods (bootstrap and traditional ones), to see where another method gives better one-sided confidence interval estimations.
- Repeating step 2. for two-sided confidence intervals.
We compared different strategies of the cases bootstrap based on their accuracy and ability to mimic the DGP's variational properties.
More detailed results can again be found in the thesis, in chapter 4. In short, we answered to the above steps:
- The best general bootstrap method is the double bootstrap. Additionally we recommend to use the standard bootstrap when estimating confidence intervals of extreme percentiles.
- There is no method (bootstrap or traditional) that would have significantly better accuracy in most of the repetitions for experiments on any DGP. Only for the correlation, Fisher's method is equally accurate but more correct.
- Results for two-sided intervals have the same conclusions.
True coverages for separate experiments and comparisons over several dimensions can be observed interactively on this site.
We recommend using the strategy that samples with replacement on all levels, as it has the best accuracy, and it best mimics the DGP's variational properties.
Data generating processes used are implemented in the file generators.py
, where you can also add your custom DGP, by
extending the DGP
class.
To get the results of all experiments, both for non-parametric and hierarchical case, you can run the file
ci_comparison.py
. You can change the desired DGP's, sample sizes, parameters, methods and confidence levels in the
main function of the file to serve your interests.
You can visually compare all methods' accuracy and correctness over different experiments by running the function
main_plot_comparison
from results_visualizations.py
file.
If you want to plot the accuracy and correctness for each experiment together, run the function
separate_experiment_plots
from the same file.
The quantitative result analysis is done in file result_analysis.py
. Results for first step of the analysis, selection
of the best bootstrap method, can be obtained with function aggregate_results
. Second step results are obtained using
function analyze_experiments
.
For third step we used the same function only changing the value of the parameter sided
to 'twosided'
.