R helper functions

Helper functions to make it easier to analyse and summarise data and results in R. Use at your own risk!!! If anything doesn't work or is wrong, let me know and I'll try to fix it as soon as possible (create an issue or email me at hauselin@gmail.com).

Summarise statistical models plus effect sizes

When fitting models, use summaryh() in place of summary() to get APA (American Psychological Association) formatted output that also includes effect size estimates for each effect (r effect size).

Currently accepts models fitted with these functions: lm, anova, aov, chisq.test, cor.test, glm, lmer, glmer, lme, t.test. For lmer models, p-values must have been computed with lmerTest. Feel free to modify the functions for other models or other formatting styles. Or let me know if you want the function to work with other model types.

To use summaryh, run this line of code each time you start a new R session: source("https://raw.githubusercontent.com/hauselin/Rcode/master/summaryh.R"). The first time you run this line of code, it will take some time because it's going to install a few R packages. Subsequently, it should load the functions much faster.

Arguments in summaryh(model, decimal = 2, showTable = F, showEffectSizesTable = F)

model (required): fitted model
decimal (default = 2): decimal places of output
showTable (default = F): show results in table format
showEffectSizesTable (default = F): show other effect sizes computed using es function (see sections below) (d, r, R², f, odds ratio, log odds ratio, area under curve)

Example outputs (output is data.table class)

summaryh(lm(mpg ~ qsec, mtcars)): b = 1.41, SE = 0.56, t(30) = 2.53, p = .017, r = 0.42
summaryh(aov(mpg ~ gear, mtcars)): F(1, 30) = 9.00, p = .005, r = 0.48
summaryh(cor.test(mtcars$mpg, mtcars$gear)): r(30) = 0.48, p = .005
summaryh(t.test(mpg ~ vs, mtcars)): t(23) = −4.67, p < .001, r = 0.70
summaryh(glm(vs ~ 1, mtcars, family = "binomial")): b = −0.25, SE = 0.36, z(31) = −0.71, p = .481, r = −0.07

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/summaryh.R")

# linear regression
model_lm <- lm(mpg ~ cyl, mtcars) 
summary(model_lm) # base R summary()
summaryh(model_lm) # returns APA-formatted output in a data.table (output below)
##           term                                                 results
## 1: (Intercept) b = 37.88, SE = 2.07, t(30) = 18.27, p < .001, r = 0.96
## 2:         cyl b = −2.88, SE = 0.32, t(30) = −8.92, p < .001, r = 0.85 
summaryh(model_lm, decimal = 5, showTable = T, showEffectSizesTable = T) # optional arguments

# linear mixed effects regression
library(lme4); library(lmerTest) # load packages to fit mixed effects models
model <- lmer(weight ~ Time * Diet  + (1 + Time | Chick), data = ChickWeight)
summary(model)
summaryh(model, decimal = 4, showTable = T, showEffectSizesTable = T) # optional arguments

# ANOVA
summaryh(aov(mpg ~ gear, mtcars))

# correlation
cor.test(mtcars$mpg, mtcars$cyl)
summaryh(cor.test(mtcars$mpg, mtcars$cyl))
summaryh(cor.test(mtcars$mpg, mtcars$cyl), 3, T, T)

Convert between effect sizes

The es function converts one effect size into other effect sizes (e.g., d, r, R², f, odds ratio, log odds ratio, area-under-curve [AUC]). Note that AUC calculations may not be correct!

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/es.R")

es(d = 0.3) # convert Cohen's d to other effect sizes
es(r = c(0.1, 0.3, 0.5)) # convert multiple effect sizes (r) to other effect sizes

Detect and remove outliers

We can identify and remove outliers in our data by identifying data points that are too extreme—either too many standard deviations (SD) away from the mean or too many median absolute deviations (MAD) away from the median. The SD approach might not be ideal with extreme outliers, whereas the MAD approach is much more robust (for comparison of both approaches, see Leys et al., 2013, Journal of Experimental Social Psychology).

detectOutliers script contains two functions: outliersZ and outliersMAD

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/detectOutliers.R")

example <- c(1, 3, 3, 6, 8, 10, 10, 1000) # 1000 is an outlier
outliersZ(example) # SD approach
outliersMAD(example) # MAD approach

# changing a few default values (note that if zCutOff is 3, 1000 ISN'T considered an outlier!)
outliersZ(x = example, zCutOff = 3, replaceOutliersWith = -999) # common to use 1.96, 2.5, 3 for Z cutoff
outliersMAD(x = example, MADCutOff = 3, replaceOutliersWith = -888) # Leys et al. (2003) recommends 2.5 to 3 for MAD cutoff

Compute between- and within-subjects standard errors and confidence intervals

Code adapted from Cookbook for R

When using the functions se (between-subjects) or seWithin (within-subjects), you can specify more than one outcome variable via the measurevar argument. If you specify more than one outcome variable, the output will be a list that has length of the number of outcome variables provided.

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/se.R")

# between-subjects standard error and confidence intervals (95% default)
se(data = mtcars, measurevar = "disp", groupvars = "cyl")
se(data = mtcars, measurevar = c("mpg", "disp"), groupvars = c("cyl", "am"))
se(data = ChickWeight, measurevar = "weight", groupvars = "Diet")

# within-subjects standard error and confidence intervals (95% default)
seWithin(data = ChickWeight, measurevar = "weight", betweenvars = "Diet", withinvars = "Time", idvar = "Chick")

Fit EZ-diffusion model for two-choice response time tasks

fit_ezddm function fits Wagenmaker et al.'s (2007) EZ-diffusion model for two-choice response time tasks. To use the function, ensure your dataframe is in long form, has single-trial reaction time (in seconds) and accuracy (coded as 0/1 or "lower"/"upper") on each row. You can use the function to fit the EZ-diffusion model to just a single subject or multiple subjects, and separately for each experimental condition (see below for examples).

Assumptions of EZ-diffusion model

error and correct reaction-time distributions are identical (often violated!)
z = .5: starting point is equidistant from the response boundaries
sv = 0: inter-trial variability in drift rate is negligible
sz = 0: inter-trial variability in starting point is negligible
st = 0: inter-trial range in nondecision time is negligible

To use/download fit_ezddm, run this line of code: source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ezddm.R"). The first time you run this line of code, it will take some time because; subsequently, it should load the functions much faster.

Arguments in fit_ezddm(data, rts, responses, id = NULL, group = NULL, simCheck = TRUE, decimal = 4)

data (required): data object with reaction time and accuracy variables (long form data expected)
rts (required; in seconds): specify in characters the name of the reactiontime column
responses (required; coded as 0/1 or "lower"/"upper"): specify in characters the name of the accuracy column
id (default = NULL): specify in characters the name of your subject/id column (if not specified, assumes data [all rows] belong to a single subject)
group (default = NULL): specify in characters the name of your column(s) indicating various conditions
simCheck (default = TRUE): simulate data (n = 1000) with estimated parameters (using rdiffusion) to check model fit
decimal (default = 4): round parameter estimates

Output (tibble and data.table class)

subject: returns this variable only there's more than one subject
group/condition names: returns these variables only if you specify grouping variables
n: total number of trials
n0: number of lower bound trials (response = 0 or 'lower')
n1: number of upper bound trials (response = 1 or 'upper')
a: threshold/boundary
v: drift rate/evidence accumulation rate
t0_Ter: non-decision time
rt1Var: reaction time variance for upper-bound/correct trials (used by ezddm to compute parameters)
response: proportion of upper bound (1) responses
responseSim: simulated proportion of upper bound responses (based on parameter estimates)
rtOverall: reaction time for upper and lower-bound trials
rtOverallSim: simulated reaction time for upper and lower trials
rt0: reaction time for lower-bound trials
rt0Sim: simulated reaction time for lower-bound trials
rt1: reaction time for upper-bound trials
rt1Sim: simulated reaction time for upper-bound trials
accAdjust: indicates if mean accuracies (acc) have been adjusted; ezddm can't estimate parameters if mean accuracy is exactly 0.5 or 1.0; if acc_adjust is 0, no adjustments have been made; if acc_adjust is 1, minor adjustments have been made

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ezddm.R")

# simulate some data
library(rtdists)
data1 <- rdiffusion(n = 100, a = 2, v = 0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
data2 <- rdiffusion(n = 100, a = 2, v = -0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
dataAll <- rbind(data1, data2) # join data
dataAll$response <- ifelse(dataAll$response == "upper", 1, 0) # convert responses to 1 and 0
dataAll$subject <- rep(c(1, 2), each = 100) # assign subject id
dataAll$cond1 <- sample(c("a", "b"), 200, replace = T) # randomly assign conditions a/b
dataAll$cond2 <- sample(c("y", "z"), 200, replace = T) # randomly assign conditions y/z

# fit model to just entire data set (assumes all data came from 1 subject)
fit_ezddm(data = dataAll, rts = "rt", responses = "response")
# fit model to each subject (no conditions)
fit_ezddm(data = dataAll, rts = "rt", responses = "response", id = "subject") 
# fit model to each subject by cond1
fit_ezddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = "cond1") 
# fit model to each subject by cond1,cond2
fit_ezddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = c("cond1", "cond2"))

Fit drift-diffusion model for two-choice response time tasks using maximum likelihood estimation (parameters: a, v, t0, z)

fit_ddm function fits four-parameter (a, v, t0, z) drift diffusion model (also known as Wiener diffusion model) to two-choice response time tasks using maximum likelihood estimation (R ucminf optimization). Assumes no or negligible inter-trial variability in drift rate (sv), starting point (sz), and non-decision time (st)—these parameters aren't estimated by fit_ddm.

To use/download fit_ddm, run this line of code: source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ddm.R"). The first time you run this line of code, it will take some time because; subsequently, it should load the functions much faster.

Arguments in fit_ddm(data, rts, responses, id = NULL, group = NULL, startParams = c(a = 2, v = 0.1, t0 = 0.3, z = 0.5), simCheck = TRUE, decimal = 4, parallel = FALSE)

data (required): data object with reaction time and accuracy variables (long form data expected)
rts (required; in seconds): specify in characters the name of the reactiontime column
responses (required; coded as 0/1 or "lower"/"upper"): specify in characters the name of the accuracy column
id (default = NULL): specify in characters the name of your subject/id column (if not specified, assumes data [all rows] belong to a single subject)
group (default = NULL): specify in characters the name of your column(s) indicating various conditions
startParams (default = c(a = 2, v = 0.1, t0 = 0.3, z = 0.5)): starting parameters for likelihood estimation with ucminf; accepts named vector (see default values) or dataframe of starting values (see example below)
simCheck (default = TRUE): simulate data (n = 1000) with estimated parameters (using rdiffusion) to check model fit
decimal (default = 4): round parameter estimates
parallel (default = FALSE): use parallel processing (when multiple startParams have been provided); uses all but 1 available cores

Output (tibble and data.table class)

subject: returns this variable only there's more than one subject
group/condition names: returns these variables only if you specify grouping variables
n: total number of trials
n0: number of lower bound trials (response = 0 or 'lower')
n1: number of upper bound trials (response = 1 or 'upper')
a: boundary/threshold
v: drift rate/evidence accumulation rate
t0: non-decision time
z: starting-point bias (0.5 is no bias)
convergence: reason for optimization termination (see ?ucminf)
value: objective function value at computed miminizer (see ?ucminf); if multiple startParams provided, only returns result with smallest value
response: proportion of upper bound (1) responses
responseSim: simulated proportion of upper bound responses (based on parameter estimates)
rtOverall: reaction time for upper and lower-bound trials
rtOverallSim: simulated reaction time for upper and lower trials
rt0: reaction time for lower-bound trials
rt0Sim: simulated reaction time for lower-bound trials
rt1: reaction time for upper-bound trials
rt1Sim: simulated reaction time for upper-bound trials

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ddm.R")

# simulate some data
library(rtdists)
data1 <- rdiffusion(n = 100, a = 2, v = 0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
data2 <- rdiffusion(n = 100, a = 2, v = -0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
dataAll <- rbind(data1, data2) # join data
dataAll$subject <- rep(c(1, 2), each = 100) # assign subject id
dataAll$cond1 <- sample(c("a", "b"), 200, replace = T) # randomly assign conditions a/b
dataAll$cond2 <- sample(c("y", "z"), 200, replace = T) # randomly assign conditions y/z

# fit model to just entire data set (assumes all data came from 1 subject)
fit_ddm(data = dataAll, rts = "rt", responses = "response")
# fit model to each subject (no conditions)
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject") 
# fit model to each subject by cond1
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = "cond1") 
# fit model to each subject by cond1,cond2
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = c("cond1", "cond2"))

# create combinations of starting values
startingValues <- expand.grid(a = seq(0.5, 4.5, by = 1.5), v = seq(-3, 3, by = 1.5), t0 = c(0.3, 0.6), z = c(0.4, 0.6))
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = c("cond1", "cond2"), startParams = startingValues)

Fit full drift-diffusion model for two-choice response time tasks using maximum likelihood estimation (parameters: a, v, t0, z, st0, sz, sv)

fit_ddmfull function fits seven-parameter (a, v, t0, z, st0, sz, sv) drift diffusion model to two-choice response time tasks using maximum likelihood estimation (R ucminf optimization). (st0: intertrial variability in non-decision time; sz: intertrial variability of starting point; sv: intertrial variability in drift rate) Works the same way as fit_ezddm and fit_ddm functions. Default starting parameters for maximum likelihood estimation are startParams = c(a = 2, v = 0.1, t0 = 0.3, z = 0.5, st0 = 0.1, sz = 0.1, sv = 0.1). To use the function, run source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ddmfull.R").