/Rcode

R functions to summarise model outputs, calculate effect sizes, remove outliers, etc.

Primary LanguageR

R helper functions

Helper functions to make it easier to analyse and summarise data and results in R. Use at your own risk!!! If anything doesn't work or is wrong, let me know and I'll try to fix it as soon as possible (create an issue or email me at hauselin@gmail.com).

Summarise statistical models plus effect sizes

When fitting models, use summaryh() in place of summary() to get APA (American Psychological Association) formatted output that also includes effect size estimates for each effect (r effect size).

Currently accepts models fitted with these functions: lm, anova, aov, chisq.test, cor.test, glm, lmer, glmer, lme, t.test. For lmer models, p-values must have been computed with lmerTest. Feel free to modify the functions for other models or other formatting styles. Or let me know if you want the function to work with other model types.

To use summaryh, run this line of code each time you start a new R session: source("https://raw.githubusercontent.com/hauselin/Rcode/master/summaryh.R"). The first time you run this line of code, it will take some time because it's going to install a few R packages. Subsequently, it should load the functions much faster.

Arguments in summaryh(model, decimal = 2, showTable = F, showEffectSizesTable = F)

  • model (required): fitted model
  • decimal (default = 2): decimal places of output
  • showTable (default = F): show results in table format
  • showEffectSizesTable (default = F): show other effect sizes computed using es function (see sections below) (d, r, R2, f, odds ratio, log odds ratio, area under curve)

Example outputs (output is data.table class)

  • summaryh(lm(mpg ~ qsec, mtcars)): b = 1.41, SE = 0.56, t(30) = 2.53, p = .017, r = 0.42
  • summaryh(aov(mpg ~ gear, mtcars)): F(1, 30) = 9.00, p = .005, r = 0.48
  • summaryh(cor.test(mtcars$mpg, mtcars$gear)): r(30) = 0.48, p = .005
  • summaryh(t.test(mpg ~ vs, mtcars)): t(23) = −4.67, p < .001, r = 0.70
  • summaryh(glm(vs ~ 1, mtcars, family = "binomial")): b = −0.25, SE = 0.36, z(31) = −0.71, p = .481, r = −0.07
# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/summaryh.R")

# linear regression
model_lm <- lm(mpg ~ cyl, mtcars) 
summary(model_lm) # base R summary()
summaryh(model_lm) # returns APA-formatted output in a data.table (output below)
##           term                                                 results
## 1: (Intercept) b = 37.88, SE = 2.07, t(30) = 18.27, p < .001, r = 0.96
## 2:         cyl b = −2.88, SE = 0.32, t(30) = −8.92, p < .001, r = 0.85 
summaryh(model_lm, decimal = 5, showTable = T, showEffectSizesTable = T) # optional arguments

# linear mixed effects regression
library(lme4); library(lmerTest) # load packages to fit mixed effects models
model <- lmer(weight ~ Time * Diet  + (1 + Time | Chick), data = ChickWeight)
summary(model)
summaryh(model, decimal = 4, showTable = T, showEffectSizesTable = T) # optional arguments

# ANOVA
summaryh(aov(mpg ~ gear, mtcars))

# correlation
cor.test(mtcars$mpg, mtcars$cyl)
summaryh(cor.test(mtcars$mpg, mtcars$cyl))
summaryh(cor.test(mtcars$mpg, mtcars$cyl), 3, T, T)

Convert between effect sizes

The es function converts one effect size into other effect sizes (e.g., d, r, R2, f, odds ratio, log odds ratio, area-under-curve [AUC]). Note that AUC calculations may not be correct!

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/es.R")

es(d = 0.3) # convert Cohen's d to other effect sizes
es(r = c(0.1, 0.3, 0.5)) # convert multiple effect sizes (r) to other effect sizes

Detect and remove outliers

We can identify and remove outliers in our data by identifying data points that are too extreme—either too many standard deviations (SD) away from the mean or too many median absolute deviations (MAD) away from the median. The SD approach might not be ideal with extreme outliers, whereas the MAD approach is much more robust (for comparison of both approaches, see Leys et al., 2013, Journal of Experimental Social Psychology).

detectOutliers script contains two functions: outliersZ and outliersMAD

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/detectOutliers.R")

example <- c(1, 3, 3, 6, 8, 10, 10, 1000) # 1000 is an outlier
outliersZ(example) # SD approach
outliersMAD(example) # MAD approach

# changing a few default values (note that if zCutOff is 3, 1000 ISN'T considered an outlier!)
outliersZ(x = example, zCutOff = 3, replaceOutliersWith = -999) # common to use 1.96, 2.5, 3 for Z cutoff
outliersMAD(x = example, MADCutOff = 3, replaceOutliersWith = -888) # Leys et al. (2003) recommends 2.5 to 3 for MAD cutoff

Compute between- and within-subjects standard errors and confidence intervals

Code adapted from Cookbook for R

When using the functions se (between-subjects) or seWithin (within-subjects), you can specify more than one outcome variable via the measurevar argument. If you specify more than one outcome variable, the output will be a list that has length of the number of outcome variables provided.

# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/se.R")

# between-subjects standard error and confidence intervals (95% default)
se(data = mtcars, measurevar = "disp", groupvars = "cyl")
se(data = mtcars, measurevar = c("mpg", "disp"), groupvars = c("cyl", "am"))
se(data = ChickWeight, measurevar = "weight", groupvars = "Diet")

# within-subjects standard error and confidence intervals (95% default)
seWithin(data = ChickWeight, measurevar = "weight", betweenvars = "Diet", withinvars = "Time", idvar = "Chick")

Fit EZ-diffusion model for two-choice response time tasks

fit_ezddm function fits Wagenmaker et al.'s (2007) EZ-diffusion model for two-choice response time tasks. To use the function, ensure your dataframe is in long form, has single-trial reaction time (in seconds) and accuracy (coded as 0/1 or "lower"/"upper") on each row. You can use the function to fit the EZ-diffusion model to just a single subject or multiple subjects, and separately for each experimental condition (see below for examples).

Assumptions of EZ-diffusion model

  • error and correct reaction-time distributions are identical (often violated!)
  • z = .5: starting point is equidistant from the response boundaries
  • sv = 0: inter-trial variability in drift rate is negligible
  • sz = 0: inter-trial variability in starting point is negligible
  • st = 0: inter-trial range in nondecision time is negligible

To use/download fit_ezddm, run this line of code: source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ezddm.R"). The first time you run this line of code, it will take some time because; subsequently, it should load the functions much faster.

Arguments in fit_ezddm(data, rts, responses, id = NULL, group = NULL, simCheck = TRUE, decimal = 4)

  • data (required): data object with reaction time and accuracy variables (long form data expected)
  • rts (required; in seconds): specify in characters the name of the reactiontime column
  • responses (required; coded as 0/1 or "lower"/"upper"): specify in characters the name of the accuracy column
  • id (default = NULL): specify in characters the name of your subject/id column (if not specified, assumes data [all rows] belong to a single subject)
  • group (default = NULL): specify in characters the name of your column(s) indicating various conditions
  • simCheck (default = TRUE): simulate data (n = 1000) with estimated parameters (using rdiffusion) to check model fit
  • decimal (default = 4): round parameter estimates

Output (tibble and data.table class)

  • subject: returns this variable only there's more than one subject
  • group/condition names: returns these variables only if you specify grouping variables
  • n: total number of trials
  • n0: number of lower bound trials (response = 0 or 'lower')
  • n1: number of upper bound trials (response = 1 or 'upper')
  • a: threshold/boundary
  • v: drift rate/evidence accumulation rate
  • t0_Ter: non-decision time
  • rt1Var: reaction time variance for upper-bound/correct trials (used by ezddm to compute parameters)
  • response: proportion of upper bound (1) responses
  • responseSim: simulated proportion of upper bound responses (based on parameter estimates)
  • rtOverall: reaction time for upper and lower-bound trials
  • rtOverallSim: simulated reaction time for upper and lower trials
  • rt0: reaction time for lower-bound trials
  • rt0Sim: simulated reaction time for lower-bound trials
  • rt1: reaction time for upper-bound trials
  • rt1Sim: simulated reaction time for upper-bound trials
  • accAdjust: indicates if mean accuracies (acc) have been adjusted; ezddm can't estimate parameters if mean accuracy is exactly 0.5 or 1.0; if acc_adjust is 0, no adjustments have been made; if acc_adjust is 1, minor adjustments have been made
# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ezddm.R")

# simulate some data
library(rtdists)
data1 <- rdiffusion(n = 100, a = 2, v = 0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
data2 <- rdiffusion(n = 100, a = 2, v = -0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
dataAll <- rbind(data1, data2) # join data
dataAll$response <- ifelse(dataAll$response == "upper", 1, 0) # convert responses to 1 and 0
dataAll$subject <- rep(c(1, 2), each = 100) # assign subject id
dataAll$cond1 <- sample(c("a", "b"), 200, replace = T) # randomly assign conditions a/b
dataAll$cond2 <- sample(c("y", "z"), 200, replace = T) # randomly assign conditions y/z

# fit model to just entire data set (assumes all data came from 1 subject)
fit_ezddm(data = dataAll, rts = "rt", responses = "response")
# fit model to each subject (no conditions)
fit_ezddm(data = dataAll, rts = "rt", responses = "response", id = "subject") 
# fit model to each subject by cond1
fit_ezddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = "cond1") 
# fit model to each subject by cond1,cond2
fit_ezddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = c("cond1", "cond2"))

Fit drift-diffusion model for two-choice response time tasks using maximum likelihood estimation (parameters: a, v, t0, z)

fit_ddm function fits four-parameter (a, v, t0, z) drift diffusion model (also known as Wiener diffusion model) to two-choice response time tasks using maximum likelihood estimation (R ucminf optimization). Assumes no or negligible inter-trial variability in drift rate (sv), starting point (sz), and non-decision time (st)—these parameters aren't estimated by fit_ddm.

To use/download fit_ddm, run this line of code: source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ddm.R"). The first time you run this line of code, it will take some time because; subsequently, it should load the functions much faster.

Arguments in fit_ddm(data, rts, responses, id = NULL, group = NULL, startParams = c(a = 2, v = 0.1, t0 = 0.3, z = 0.5), simCheck = TRUE, decimal = 4, parallel = FALSE)

  • data (required): data object with reaction time and accuracy variables (long form data expected)
  • rts (required; in seconds): specify in characters the name of the reactiontime column
  • responses (required; coded as 0/1 or "lower"/"upper"): specify in characters the name of the accuracy column
  • id (default = NULL): specify in characters the name of your subject/id column (if not specified, assumes data [all rows] belong to a single subject)
  • group (default = NULL): specify in characters the name of your column(s) indicating various conditions
  • startParams (default = c(a = 2, v = 0.1, t0 = 0.3, z = 0.5)): starting parameters for likelihood estimation with ucminf; accepts named vector (see default values) or dataframe of starting values (see example below)
  • simCheck (default = TRUE): simulate data (n = 1000) with estimated parameters (using rdiffusion) to check model fit
  • decimal (default = 4): round parameter estimates
  • parallel (default = FALSE): use parallel processing (when multiple startParams have been provided); uses all but 1 available cores

Output (tibble and data.table class)

  • subject: returns this variable only there's more than one subject
  • group/condition names: returns these variables only if you specify grouping variables
  • n: total number of trials
  • n0: number of lower bound trials (response = 0 or 'lower')
  • n1: number of upper bound trials (response = 1 or 'upper')
  • a: boundary/threshold
  • v: drift rate/evidence accumulation rate
  • t0: non-decision time
  • z: starting-point bias (0.5 is no bias)
  • convergence: reason for optimization termination (see ?ucminf)
  • value: objective function value at computed miminizer (see ?ucminf); if multiple startParams provided, only returns result with smallest value
  • response: proportion of upper bound (1) responses
  • responseSim: simulated proportion of upper bound responses (based on parameter estimates)
  • rtOverall: reaction time for upper and lower-bound trials
  • rtOverallSim: simulated reaction time for upper and lower trials
  • rt0: reaction time for lower-bound trials
  • rt0Sim: simulated reaction time for lower-bound trials
  • rt1: reaction time for upper-bound trials
  • rt1Sim: simulated reaction time for upper-bound trials
# load functions from my github site
source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ddm.R")

# simulate some data
library(rtdists)
data1 <- rdiffusion(n = 100, a = 2, v = 0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
data2 <- rdiffusion(n = 100, a = 2, v = -0.3, t0 = 0.5, z = 0.5 * 2) # simulate data
dataAll <- rbind(data1, data2) # join data
dataAll$subject <- rep(c(1, 2), each = 100) # assign subject id
dataAll$cond1 <- sample(c("a", "b"), 200, replace = T) # randomly assign conditions a/b
dataAll$cond2 <- sample(c("y", "z"), 200, replace = T) # randomly assign conditions y/z

# fit model to just entire data set (assumes all data came from 1 subject)
fit_ddm(data = dataAll, rts = "rt", responses = "response")
# fit model to each subject (no conditions)
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject") 
# fit model to each subject by cond1
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = "cond1") 
# fit model to each subject by cond1,cond2
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = c("cond1", "cond2"))

# create combinations of starting values
startingValues <- expand.grid(a = seq(0.5, 4.5, by = 1.5), v = seq(-3, 3, by = 1.5), t0 = c(0.3, 0.6), z = c(0.4, 0.6))
fit_ddm(data = dataAll, rts = "rt", responses = "response", id = "subject", group = c("cond1", "cond2"), startParams = startingValues)

Fit full drift-diffusion model for two-choice response time tasks using maximum likelihood estimation (parameters: a, v, t0, z, st0, sz, sv)

fit_ddmfull function fits seven-parameter (a, v, t0, z, st0, sz, sv) drift diffusion model to two-choice response time tasks using maximum likelihood estimation (R ucminf optimization). (st0: intertrial variability in non-decision time; sz: intertrial variability of starting point; sv: intertrial variability in drift rate) Works the same way as fit_ezddm and fit_ddm functions. Default starting parameters for maximum likelihood estimation are startParams = c(a = 2, v = 0.1, t0 = 0.3, z = 0.5, st0 = 0.1, sz = 0.1, sv = 0.1). To use the function, run source("https://raw.githubusercontent.com/hauselin/Rcode/master/fit_ddmfull.R").