- add stratification to doubleBootstrap
This repository contains code, walkthroughs, examples, and applications of bootstrap methods. It utilizes the computing infrastructure of SAS Viya's CAS engine. This allows distributed computation of bootstrap iterations in parallel with very minimal code!
When our sample is limited (when isn't it?) and we want to understand parameter estimates, it is desirable to resample the population. With bootstrapping, we can resample the sample many times to further learn from our sample and assess uncertainty. The benefit of this project is making the "many times" easy and fast for the user.
For an acknowledgement of the importance of the bootstrap, a great place to start is the writeup for it's inventor being awarded the International Prize in Statistics. Thank you Professor Efron for putting a computer on the desk (now, the cloud) of every statistician.
All code is written in SAS CASL which can be executed from a SAS interface with PROC CAS or from the various (Python, R, REST,...) API's. As of SAS Viya version 3.4 there is not a packaged bootstrap action. There is an included Sampling And Partitioning Action Set but it does not have options for sampling with replacement.
This repository has a user defined action set and instructions for loading it in your environment. This also makes a great example of how to easily extend the capabilities of SAS Viya and share with all users in your environment.
Have something to add? Just clone or branch it, commit changes, and create a pull request!
Review the section RepositoryLayout to understand dependencies in the repository structure.
Have comments, questions, suggestions? Just use the issues feature in GitHub
As updates are made to the repository there are dependencies between files and folders. The primary file is resample - defineActionSet.sas
and any updates to it will require updates in /walkthroughs
and Readme.md
. Some folders and files are standalone like /tools
and /applications
but additions still need to be added to Readme.md
. Also, /examples
may need to be updated if the actions calls are updated with parameter changes.
- resample - defineActionSet.sas The resample actionset definition file.
- Folder: examples contains examples of using the actions
- example 1 - loading and using bootstrap action from resample.sas
- example 2 - regression bootstrap parameter estimates.sas
- example 3 - regression double-bootstrap parameter estimates.sas
- example 4 - regression jackknife parameter estimates.sas
- example 5 - using bootstrap results to diagnose influence with model fit.sas
- example 6 - using bootstrap results to diagnose influence with model accuracy.sas
- example 7 - residual bootstraping.sas
- Folder: walkthroughs contains step-by-step commented versions of the code within the actions to help understand how they work. This is great for learning!
- Folder: applications will soon contain broader applications of the actions
- Folder: upcoming contains walkthroughs and examples in development (undocumented here)
- Folder: tools a set of stand-alone examples to diagnose and understand the computing environment
Run the code in resample - defineActionSet.sas. Some lines that may need changing:
- line 1: connects to a CAS session
- To Save the actions for future sessions and use by other users:
- line 174: create an in-memory table of the action set
- line 175: persist the in-memory table in .sashdat file. Here it is pointed as caslib="Public".
- If you need to remove the action set then uncomment and use:
- line 177: removes the persisted in-memory table
To use the actions you will need to load the user defined actions with:
builtins.actionSetFromTable / table={caslib="Public" name="resampleActionSet.sashdat"} name="resample";
Table of contents:
- Relationship Map
- Quickstart Examples
- resample.addRowID action
- resample.bootstrap action
- resampe.doubleBootstrap action
- resample.jackknife action
- resample.percentilePE action
This is a reference chart for the relationship between the actions and their output tables.
resample.addRowID | resample.bootstrap | resample.doubleBootstrap | resample.jackknife | resample.percentilePE |
---|---|---|---|---|
resample.addRowID / intable="sample"; |
resample.bootstrap / intable="sample" Seed=12345 B=100 Bpct=1; case="unique_case" strata="strata" strata_table="tableName" |
resample.doubleBootstrap / intable="sample" Seed=12345 B=100 Bpct=1 D=50 Dpct=1 case="unique_case"; |
resample.jackknife / intable="sample" case="unique_case"; |
resample.percentilePE / intable="sample" alpha=0.05; |
Updates the provided table with a new column named RowID that has a naturally numbered (1,2,...,n) across the distributed in-memory table.
- rowID - is the naturally numbered (1, 2, ..., n) row identifier for the sampled row in
<intable>
CASL Syntax
resample.addRowID /
intable="string"
intable="string"
- required
- Specifies the name of the table in cas
Creates a table of bootstrap resamples from table <intable>
and stores them in a table named <intable>_bs
. Runs the addRowID action on the <intable>
cases. Columns that describe the link between the bootstrap resamples and the original sample are:
- bsID - is the naturally numbered (1, 2, ..., b) identifier of a resample
- bs_caseID - is the naturally numbered (1, 2, ..., n) case identifier within the value of bsID
- caseID - is the naturally numbered (1, 2, ..., n) case identifier for the resampled case in
<intable>
- see strata- strata (if strata= is a column in
<intable>
) - defines subgroups (by groups) in<intable>
for which caseID numbering is unique. Each strata levels starts over at caseID=1 and has decimal value representing the unique strata levels: 1.01, 2.01, 3.01 are all for the same strata level (.01) while 1.02, 2.02, 3.02 are all for strata level (.02).
- strata (if strata= is a column in
- bag - is 1 for resampled case, 0 for caseID values not resampled within the bsID (will have missing for bs_caseID)
The resample.bootstrap action has two input parameters to direct stratification: strata
and strata_table
. If the value of strata
is not a column in <intable>
then stratification does not occur and bootstrap resampling proceeds on the full ''.
The strata_table
parameter allows you to provide an input table with columns strata
(same name as in <intable>
and provided with the strata
parameter), strata_n
, and strata_dist
(optional).
An Example <strata_table>
might look like:
MyStrata | strata_n |
---|---|
Level 1 | 10 |
Level 2 | 100 |
Level 3 | 45 |
If you want to randomly assign the strata_n
value for each strata level in each bootstrap resample then use the strata_dist
parameter to provide input for the SAS RAND Function.
MyStrata | strata_n | strata_dist |
---|---|---|
Level 1 | 'normal,0,20' | |
Level 2 | 'normal,200,50' | |
Level 3 | 'hyper,200,50,50' |
Notes on the parameter precedence:
- The bootstrap action will check for the
strata
variable in<intable>
and compute the number of observations present for each level. - If
<strata_table>
does not exist or does not contain thestrata
variable then bootstrap commences with the<intable>
size for each strata level adjusted by theBpct
parameter. - Else If
<strata_table>
does exist then theBpct
parameter is ignored and:- Adds any missing levels of
strata
found in the<intable>
and populatesstrata_n
- Removes any levels of
strata
not found in<intable>
- If
strata_dist
is present then it is used to calcualte a newstrata_n
for each bootstrap sample- Else If
strata_dist
is not provided, then the input value ofstrata_n
is used for each bootstrap sample - Else If
strata_n
is not provided, then the calculated value from<intable>
is used for each bootstrap sample
- Else If
- Adds any missing levels of
CASL Syntax
resample.bootstrap /
intable="string"
case="string"
strata="string"
strata_table="string"
B=integer
Bpct=double
seed=integer
intable="string"
- required
- specifies the name of the table to resample from in CAS
case="string"
- required
- Specifies the name of the column from
<intable>
that connects groupings of rows that make up cases. If the value specified is not a column name in<intable>
then the rows will be used as individual cases during resampling.
strata="string"
- required
- Specifies the name of the column from
<intable>
that is used to partition or group the rows before resampling. Resampling will happen independently within each level of the strata variable and the Bpct= parameter will apply separately to each strata level. This essentially acts as a by variable for resampling. If the value specified is not a column name in<intable>
then bootstrap sampling proceeds without stratification.
strata_table="string"
- required
- Specifies the name of a table
<strata_table>
that is used to specify the sample info for levels of thestrata
variable. See Notes on Stratification above.
B=integer
- required
- Specifies the desired number of bootstrap resamples.
- Note: Will look at the number of threads (nthreads) in the environment and set the value of bss (resamples per threadid) to ensure the final number of bootstrap resamples is >=B.
Bpct=double
- required (optional with default=1 in the future)
- The percentage of the number of sample cases (intable) to use as the resample size 1=100%
seed=integer
- required (optional with default=0 in the future)
- Sets the seed for random sampling. If missing, zero, or negative then SAS will compute a default seed.
- See the documentation for Call Streaminit for further information on specifying a seed and changing the random-number generator (RNG).
Creates a table of bootstrap and double-bootstrap resamples from table <intable>
and stores them in tables <intable>_bs
and <intable>_dbs
. Runs the addRowID action on the <intable>
cases. If the bootstrap action has already been run on table <intable>
then a table <intable>_bs
already exist and will be used for double-bootstraping. Columns that describe the link between the double-bootstrap resamples and the bootstrap resamples are:
- bsID - is the naturally numbered (1, 2, ..., b) identifier of a resample
- dbsID - is the naturally numbered (1, 2, ..., d) identifier of a resample from a bsID
- dbs_caseID - is the naturally numbered (1, 2, ..., n) case identifier within the value of dbsID
- bs_caseID - is the naturally numbered (1, 2, ..., n) case identifier for the resampled case in bsID
- caseID - is the naturally numbered (1, 2, ..., n) case identifier for the resampled case in
<intable>
- see strata- strata (if strata= is a column in
<intable>
for a previously run resample.bootstrap call) - defines a subgroup (by group) in<intable>
for which caseID numbering is unique. Each strata levels starts over at caseID=1 and has decimal value representing the unique strata levels: 1.01, 2.01, 3.01 are all for the same strata level (.01) while 1.02, 2.02, 3.02 are all for strata level (.02). - Note: stratification is not yet available for the doubleBootstrap action. You can still run the bootstrap action first with stratification and then use the result table for doubleBootstrap technique. This feature is being worked on.
- strata (if strata= is a column in
- bag - is 1 for resampled cases, 0 for caseID values not resampled within the bsID (will have missing for bs_caseID)
- 0 could be a non-resampled row in either the bsID or the dbsID (resampled from bsID)
CASL Syntax
resample.doubleBootstrap /
intable="string"
case="string"
B=integer
D=integer
seed=integer
Bpct=double
Dpct=double
intable="string"
- required
- specifies the name of the table to resample from in CAS
case="string"
- required
- Specifies the name of the column from
<intable>
that connects groupings of rows that make up cases. If the value specified is not a column name in<intable>
then the rows will be used as individual cases during resampling.
B=integer
- required
- Specifies the desired number of bootstrap resamples. Will look at the number of threads (nthreads) in the environment and set the value of bss (resamples per threadid) to ensure the final number of bootstrap resamples is >=B.
- Note: If you run resample.bootstrap first then you should use the same value of B (it will ignore the value and use the value from the prior bootstrap).
- If you don't run resample.bootstrap first then resample.doubleBootstrap will run it first.
D=integer
- required
- Specifies the desired number of double-bootstrap resamples from each bootstrap resample.
Bpct=double
- required (optional with default=1 in the future)
- The percentage of the number of sample cases (intable) to use as the resample size 1=100%
Dpct=double
- required (optional with default=1 in the future)
- The percentage of the number of bootstrap resample cases (intable_bs) to use as the double-bootstrap resample size 1=100%
- Note: if Bpct is set to 50% (0.5) and Dpct is set to 100% (1) the the double-bootstrap resamples will still be 50% of the size of the original samples (intable) number of cases
seed=integer
- required (optional with default=0 in the future)
- Sets the seed for random sampling. If missing, zero, or negative then SAS will compute a default seed.
- See the documentation for Call Streaminit for further information on specifying a seed and changing the random-number generator (RNG).
Note: The number of double-bootstrap resamples is atleast BD. For Example: B=1000 and D=1000 yields at least BD=1000000
Creates a table of jackknife resamples from table <intable>
and stores them in table <intable>_jk
. Runs the addRowID action on the <intable>
cases. There will be J resamples identified with jkID, where J is equal to the number of cases in <intable>
. The values of jkID are numbered 1, 2, ... n and each has resampled cases identified by caseID. When caseID from <intable>
is equal to jkID the case is deleted/omitted.
- jkID - is the naturally numbered (1, 2, ..., n) identifier of a resample
- caseID - is the naturally numbered (1, 2, ..., n) case identifier for the resampled case in
<intable>
CASL Syntax
resample.jackknife /
intable="string"
case="string"
intable="string"
- required
- specifies the name of the table to resample from in CAS
case="string"
- required
- Specifies the name of the column from
<intable>
that connects groupings of rows that make up cases. If the value specified is not a column name in<intable>
then the rows will be used as individual cases during resampling.
This action creates percentile confidence intervals for parameter estimates. The action uses the sample table name <intable>
and expect to find <intable>
_PE for the full data parameter estimates and one or more of <intable>
_BS_PE, <intable>
_DBS_PE, and <intable>
_JK_PE that are the parameter estimates from fitting the model to the resample data create by the bootstrap, doubleBoostrap, and jackknife actions. Check example 2, example 3, and example 4 for help using this action.
CASL Syntax
resample.percentilePE /
intable="string"
alpha=double
intable="string"
- required
- specifies the name of the sample table
- The action will look for
<intable>
_BS_PE,<intable>
_DBS_PE, and<intable>
_JK_PE. It expects to also find<intable>
_PE for the full data. These are the outputs of parameter estimates from the model action of your choice. Run the resample action(s) you want then do groupby model fitting with the model action of your choice and use this naming for the PE files.
alpha=double
- required
- specifies the alpha level of the two-sided percentile confidence interval that will be constructed. Provide a value in (0,1).
- SAS Support Supplied macros for Bootstrap, Jackknife and some bias and confidence interval computations
- The DO Loop Blog: The essential guide to bootstrapping in SAS
-
Bootstrap
- Take a sample dataset with rows 1, ..., n. Create B resamples with replacement from the sample dataset. Each resample with also have n rows. Rows included in a resample, b, are called bagged. Rows not selected for a particular resample, b, are called out-of-bag.
-
Double-bootstrap
- First bootstrap as described above to create B resamples. For each resample, b, do subsequent resamples called double-bootstraps. Each of these double-bootstraps also have n rows where the rows are sampled with replacement from the corresponding bootstrap sample.
-
jackknife
- This resampling technique takes resamples of size n-1 from the original sample of size n. There will be J=n jackknife resamples where each has N-1 rows and the missing row is j=n.