DSC381
DSC-381 Simulation Work
This contains compiled simulation functions in python for statistical analysis related to DSC381.
Prerequisites:
- Python 3
- The following packages:
- numpy
- pandas
- matplotlib
- scipy
- statsmodels
Functionality
Currently, there is one python script (statistics_functions_py) capable of doing the following tests:
Hypothesis Testing
- Hypothesis test for a single mean
hyptest_singlemean(data, null_hyp = 0, n = 10000, alt = 'two-sided')
Returns simulation p-value (float) for hypothesis test of a single mean.
Parameter(s):
data (array) is the dataset for analysis
null_hyp (float) is the null hypothesis
n (int) is the number of bootstrap simulations
alt (string) defines alt hypothesis ['two-sided, 'less', 'greater']
- Hypothesis test for a single in proportion
hyptest_singleprop(p_sample, p_null = 0.5, size = 30, n = 10000, alt = 'two-sided', disp = 'count')
Returns simulation p-value (float) for hypothesis test of a single proportion.
Parameter(s):
p_sample (float) is the sample proportion
p_null (float) is the null hypothesis proportion
size (int) is the sample size
n (int) is the number of bootstrap simulations
alt (string) defines alt hypothesis ['two-sided, 'less', 'greater']
disp (string) choose plot display ['count', 'prop']
- Hypothesis test for difference in two means
hyptest_diffmeans(data_1, data_2, n = 10000, alt = 'two-sided')
Returns simulation p-value (float) for hypothesis test of a difference in two means
Null hypothesis is defined as the two means being equal
Parameter(s):
data_1 (array) is the dataset for first mean
data_2 (array) is the dataset for second mean
n (int) is the number of bootstrap simulations
alt (string) defines alt hypothesis ['two-sided, 'less', 'greater']
- Hypothesis test for difference in two proportions
hyptest_diffprops(p_1, size_1, p_2, size_2, n = 10000, alt = 'two-sided')
Returns simulation p-value (float) for hypothesis test of a difference in two proportions
Null hypothesis is defined as the two proportions being equal
Parameter(s):
p_1 (float) is proportion for dataset 1
size_1 (int) is the size of dataset 1
p_2 (float) is the proportion for dataset 2
size_2 (int) is the size of dataset 2
n (int) is the number of bootstrap simulations
alt (string) defines alt hypothesis ['two-sided, 'less', 'greater']
- Hypothesis test for a slope
hyptest_slope(features, targets, n=10000, alt='two-sided')
Returns simulation p-value (float) for hypothesis test of a slope (correlation) of two datasets
Null hypothesis is that there is no correlation between the two datasets
Parameter(s):
features (array) is a single dimension array of features
targets (array) is a single dimension array of targets
n (int) is the number of bootstrap simulations
alt (string) defines alt hypothesis ['two-sided, 'less', 'greater']
Confidence Intervals
- Confidence Interval for a statistic
ci_statistic(data, n=10000, l_tail=5, r_tail=95, stat='mean')
Returns returns confidence interval (tuple) for a statistic (left tail, right tail, length)
Parameter(s):
data (array) is the dataset for analysis
n (int) is the number of bootstrap simulations
l_tail (float) is the cut off percentile for the left tail [0,100]
r_tail (float) is the cut off percentile for the right tail [0,100]
stat (string) is the statistic being analyzed ["mean","median","stdev"]
- Confidence Interval for a proportion
ci_prop(p, size, n=10000, l_tail=5, r_tail=95)
Returns returns confidence interval (tuple) a proportion (left tail, right tail, length)
Parameter(s):
p (float) is proportion for the dataset
size (int) is the size of the dataset
n (int) is the number of bootstrap simulations
l_tail (float) is the cut off percentile for the left tail [0,100]
r_tail (float) is the cut off percentile for the right tail [0,100]
- Confidence Interval for difference of two proportions
ci_diffprops(p_1, size_1, p_2, size_2, n=10000, l_tail=5, r_tail=95)
Returns returns confidence interval (tuple) for difference in proportions (left tail, right tail, length)
Parameter(s):
p_1 (float) is proportion for dataset 1
size_1 (int) is the size of dataset 1
p_2 (float) is the proportion for dataset 2
size_2 (int) is the size of dataset 2
n (int) is the number of bootstrap simulations
l_tail (float) is the cut off percentile for the left tail [0,100]
r_tail (float) is the cut off percentile for the right tail [0,100]
- Confidence Interval for difference of two proportions
ci_diffstatistics(data_1, data_2, n=10000, l_tail=5, r_tail=95, stat='mean')
Returns the confidence interval for the difference in two statistics
Parameter(s):
data_1 (array) is the dataset for first mean
data_2 (array) is the dataset for second mean
n (int) is the number of bootstrap simulations
l_tail (float) is the cut off percentile for the left tail [0,100]
r_tail (float) is the cut off percentile for the right tail [0,100]
stat (string) is the statistic being analyzed ["mean","median","stdev"]
- Confidence Interval for a slope
ci_slope(features, targets, n=10000, l_tail=5, r_tail=95)
Returns simulation p-value (float) for hypothesis test of a slope (correlation) of two datasets
Null hypothesis is that there is no correlation between the two datasets
Parameter(s):
features (array) is a single dimension array of features
targets (array) is a single dimension array of targets
n (int) is the number of bootstrap simulations
alt (string) defines alt hypothesis ['two-sided, 'less', 'greater']
Sample Size and Margin Estimation
samplesize_stat(ci, stdev, margin):
Returns the minimum required sample size (int) given a confidence interval, standard deviation, and sample size for a statistic
Parameter(s):
ci (float) is the desired confidence interval in an interval (0, 1)
stdev (float) is the estimated population standard deviation
margin (float) is the desired margin of error of the statistic
marginsize_stat(ci, stdev, size):
Returns the margin size (float) given a confidence interval, standard deviation, and sample size for a statistic
Parameter(s):
ci (float) is the desired confidence interval in an interval (0, 1)
stdev (float) is the estimated population standard deviation
size (int) is the size of the sample
marginsize_stat(ci, stdev, size):
Returns the margin size (float) given a confidence interval, standard deviation, and sample size for a statistic
Parameter(s):
ci (float) is the desired confidence interval in an interval (0, 1)
stdev (float) is the estimated population standard deviation
size (int) is the size of the sample
samplesize_prop(ci, p_est = 0.5, margin = 0.05):
Returns the minimum required sample size (int) given a confidence interval, est. proportion, and margin of error for a proportion
Parameter(s):
ci (float) is the desired confidence interval in an interval (0, 1)
stdev (float) is the estimated population standard deviation
margin (float) is the desired margin of error of the statistic
marginsize_prop(ci, p_est = 0.5, size = 100):
Returns the margin (float) given a confidence interval, est. proportion, and sample size for a proportion
Parameter(s):
ci (float) is the desired confidence interval in an interval (0, 1)
stdev (float) is the estimated population standard deviation
size (int) is the size of the sample