bootstrap-CIHT

A Python module to calculate confidence intervals and perform hypothesis tests on the mean or proportion of one or two groups using bootstrap sampling.
Selected (hopefully useful) visualizations of the data and null distribution are provided via matplotlib.

Index

Description
Parameters
Attributes
Methods
Usage
Dependencies
Installation
Example
Additional Tests
History
License

Description

class bootstrap_CIHT.Bootstrap_CIHT(data, data_col, num_vars=1, null_mean=0.0, group_col=None, group1=None, group2=None, samples=10000, alpha=0.05, h_sides=2, h1_dir=None)

Parameters

data (Pandas dataframe):
Unprocessed user data. Should contain at least a column with the data to analyze. For proportions, data should be coded as 0 or 1. If a two-group comparison, should also have a column containing group membership labels.

data_col (string):
Name of column containing the data to be analyzed

num_vars (int):
Number of groups (1 or 2); default 1

null_mean(float):
For one-group test, comparison mean or proportion; default 0.0
Should be left at 0.0 for two-group comparisons.

group_col (string):
For two-group comparison, name of column with group labels

group1 (int, bool, string):
For two-group comparison, label for first (control) group

group2 (int, bool, string):
For two-group comparison, label for second (experiment) group

samples (int):
Number of bootstrap samples; default 10000

alpha (float):
Sets size of confidence interval (100*(1-alpha)); default 0.05

h_sides (int):
One- or two-sided confidence interval & hypothesis test; default 2

h1_dir (string):
Inequality direction for 1-sided alternative hypothesis:
'greater' -> experiment parameter > control parameter
'less' -> experiment parameter < control parameter
None (default) -> 2-sided comparison

Attributes:

df (Pandas dataframe):
Dataframe derived from user-supplied data for use in analysis

Methods:

fill_data()
Extract data for analysis from user-supplied dataframe. Performed automatically on instantiation.
Args:
None
Returns:
df (Pandas dataframe):
Dataframe containing data to be analyzed ("data" column) and group membership ("group" column) for two-group comparison

get_bootstrap_sample()
Create sampling distributions using bootstrapping.
Args:
None
Returns:
For one-group analysis:
experiment_mean (numpy array):
Bootstrapped means of the group of interest
For two-group comparison:
experiment_mean (numpy array):
Bootstrapped means for experiment group
control_mean (numpy array):
Bootstrapped means for control group
diffs (numpy array):
Differences in means of bootstrapped samples

calculate_CI(means)
Compute and print 100*(1-alpha) confidence intervals for the relevant sampling distribution. For a one-group case, this is for the mean of the data. For a two-group case, it is for the difference in means.
Args:
means (numpy array):
Contains either the bootstrapped mean (for a single group) or difference in means (for two groups)
Returns:
CI (tuple):
Upper and lower confidence limits

calculate_hypothesis_test(means)
Calculate p-value for a 1- or 2-side hypothesis test comparing the sample mean to the null mean. For two-group comparisons, difference in means is compared to a null mean of zero.
Plots the observed statistic on a histogram of the null distribution.
Args:
means (numpy array):
Contains either the bootstrapped mean (for a single group) or difference in means (for two groups)
Returns:
p_value (float):
Probability of the observed statistic given the null hypothesis

plot_hist_CI(means, CI, bins=10)
Plot the sampling distribution with confidence intervals as vertical red lines. For the single-group case, also plots the null mean as a vertical green line.
Args:
means (numpy array):
Contains either the bootstrapped mean (for a single group) or difference in means (for two groups)
CI (tuple):
Confidence limits.
bins (int):
Number of bins in the histogram; default 10
Returns:
None

plot_distribs(experiment_mean, control_mean, bins=10)
For a two-group comparison, plot histograms of the control and experimental sampling distributions on a single graph.
Args:
experiment_mean (np array):
Bootstrapped samples of mean for experiment group
control_mean (np array):
Bootstrapped samples of mean for control group
bins (int):
Number of bins in the histogram; default 10
Returns:
None

Usage

Dependencies

numpy : 1.19.1
pandas : 1.1.3
matplotlib : 3.3.1
additionally for unit tests:
scipy 1.5.2

Developed & tested in Python 3.8.5

Installation

Assuming working numpy, pandas, and matplotlib:

> pip install bootstrap_CIHT

Example

This uses a 'toy' dataset (available in the tests folder) with two groups: "success" and "failure". The success group is 1000 samples from a normal distribution with mean 0.0 and standard deviation 0.5. The failure groups is 1000 samples from a normal distribution with a mean of 1.0 and standard deviation of 0.5.

Note that each pop-up plot window will need to be closed for the script to continue.

>>> import pandas as pd
>>> import numpy as np
>>> from bootstrap_CIHT import Bootstrap_CIHT as BS
>>> np.random.seed(42) #for reproducibility
>>> df = pd.read_csv('nml2.csv')
>>> bs1 = BS(df, 'numbers', num_vars=2, group_col='outcome', group1='failure', group2='success')
>>> bs1.df.head() #the processed data
       data    group
0  0.248357  success
1 -0.069132  success
2  0.323844  success
3  0.761515  success
4 -0.117077  success
>>> bs1.df.tail()
          data    group
1995  1.535075  failure
1996  0.986739  failure
1997  0.559063  failure
1998  0.918467  failure
1999  0.627549  failure
>>> exp_mean, ctrl_mean, diffs = bs1.get_bootstrap_sample()
>>> CIs = bs1.calculate_CI(diffs)
2-sided CI limits: 2.5 97.5
95% 2-sided CI for differences in means of numbers: (-1.0697323775784746, -0.9825440488702097)

>>> p_val = bs1.calculate_hypothesis_test(diffs)
experimental group: success; mean: 0.0097
control group: failure; mean: 1.0354
difference of means, sampling distribution SD: -1.0258, 0.0223
H0: Difference in means success vs failure = 0
H1: Difference in means success vs failure <> 0

p-value: 0.000000

>>> bs1.plot_hist_CI(diffs, CIs, bins=20)
>>> bs1.plot_distribs(exp_mean, ctrl_mean, bins=20)

The plot produced by calculate_hypothesis_test is:

The plot from plot_hist_CI is:

Finally, plot_distribs produces:

Additional Tests

The tests folder contains unit tests in test.py. These use the same toy data as above and are run via:
>python -m unittest test.py
There are also three other files (test0.py, test1.py, test2.py, test3.py) that reproduce exercises from a Udacity Introduction to Statistics course I previously took. The source .csv files are included, as are the Jupyter notebooks with the original exercises and results.
Note that the exercise in test0.py appears to produce results that vary a fair amount with each re-sampling, so you may get inconsistent CIs and p-values.

History

Created April 12, 2021

License

Licensed under the MIT License.

ensst6/bootstrap_CIHT