output | always_allow_html |
---|---|
github_document |
true |
It is useful to be able to simulate data with a specified structure. The faux
package provides some functions to make this process easier. See the package website for more details.
You can install the released version of faux from CRAN with:
install.packages("faux")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("debruine/faux")
This function creates a dataset with a specific between- and/or within-subjects design. see vignette
For example, the following creates a 2w*2b design with 100 observations in each cell. The between-subject factor is pet
with two levels (cat
and dog
). The within-subject factor is time
with two levels (day
and night
). The mean for the cat_day
cell is 10, the mean for the cat_night
cell is 20, the mean for the dog_day
cell is 15, and the mean for the dog_night
cell is 25. All cells have a SD of 5 and all within-subject cells are correlated r = 0.5
. The resulting data has exactly these values (set empirical = FALSE
to sample from a population with these values). Set plot = TRUE
to show a plot of means and SDs.
between <- list(pet = c(cat = "Cat Owners",
dog = "Dog Owners"))
within <- list(time = c("morning", "noon", "evening", "night"))
mu <- data.frame(
cat = c(10, 12, 14, 16),
dog = c(10, 15, 20, 25),
row.names = within$time
)
df <- sim_design(within, between,
n = 100, mu = mu, sd = 5, r = .5,
empirical = TRUE, plot = TRUE)
pet | n | var | morning | noon | evening | night | mean | sd |
---|---|---|---|---|---|---|---|---|
cat | 100 | evening | 0.5 | 0.5 | 1.0 | 0.5 | 14 | 5 |
cat | 100 | morning | 1.0 | 0.5 | 0.5 | 0.5 | 10 | 5 |
cat | 100 | night | 0.5 | 0.5 | 0.5 | 1.0 | 16 | 5 |
cat | 100 | noon | 0.5 | 1.0 | 0.5 | 0.5 | 12 | 5 |
dog | 100 | evening | 0.5 | 0.5 | 1.0 | 0.5 | 20 | 5 |
dog | 100 | morning | 1.0 | 0.5 | 0.5 | 0.5 | 10 | 5 |
dog | 100 | night | 0.5 | 0.5 | 0.5 | 1.0 | 25 | 5 |
dog | 100 | noon | 0.5 | 1.0 | 0.5 | 0.5 | 15 | 5 |
Table: Sample sim_design()
stats
You can plot the data from sim_design()
and swap the factor visualisations. see vignette
p1 <- plot_design(df)
p2 <- plot_design(df, "pet", "time")
cowplot::plot_grid(p1, p2, nrow = 2, align = "v")
This function produces a data table with the same distributions and correlations as an existing data table. It only returns numeric columns and simulates all numeric variables from a continuous normal distribution (for now). see vignette
For example, the following code creates a new sample from the built-in dataset iris
with 50 observations of each species.
new_iris <- sim_df(iris, 50, between = "Species")
This function produces a data table for a basic cross-classified design with random intercepts for subjects and items.
For example, the following code produces the data for 100 subjects responding to 50 items where the response has an overall mean (grand_i
) of 10. Subjects vary in their average response with an SD of 1, items vary in their average response with an SD of 2, and the residual error term has an SD of 3.
dat <- sim_mixed_cc(
sub_n = 100, # subject sample size
item_n = 50, # item sample size
grand_i = 10, # overall mean of the score
sub_sd = 1, # SD of subject random intercepts
item_sd = 2, # SD of item random intercepts
error_sd = 3 # SD of residual error
)
You can then see how changing these numbers affects the random effects in an intercept-only mixed effects model.
lme4::lmer(y ~ 1 + (1 | sub_id) + (1 | item_id), data = dat) %>%
broom.mixed::tidy() %>%
knitr::kable(digits = 2)
#> Registered S3 method overwritten by 'broom.mixed':
#> method from
#> tidy.gamlss broom
effect | group | term | estimate | std.error | statistic |
---|---|---|---|---|---|
fixed | NA | (Intercept) | 9.79 | 0.28 | 34.83 |
ran_pars | sub_id | sd__(Intercept) | 0.99 | NA | NA |
ran_pars | item_id | sd__(Intercept) | 1.84 | NA | NA |
ran_pars | Residual | sd__Observation | 2.95 | NA | NA |
This function uses lme4::lmer()
to get subject, item and error SDs from an existing dataset and simulates a new dataset with the specified number of subjects and items with distributions drawn from the example data.
new_dat <- sim_mixed_df(fr4,
sub_n = 100,
item_n = 50,
dv = "rating",
sub_id = "rater_id",
item_id = "face_id")
This function makes multiple normally distributed vectors with specified parameters and relationships. see vignette
For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.
dat <- rnorm_multi(
n = 100,
mu = c(0, 20, 20),
sd = c(1, 5, 5),
r = c(0.5, 0.5, 0.25),
varnames = c("A", "B", "C"),
empirical = FALSE
)
#> �[32mThe number of variables (vars) was guessed from the input to be 3�[39m
n | var | A | B | C | mean | sd |
---|---|---|---|---|---|---|
100 | A | 1.00 | 0.53 | 0.55 | 0.02 | 1.08 |
100 | B | 0.53 | 1.00 | 0.29 | 20.18 | 5.15 |
100 | C | 0.55 | 0.29 | 1.00 | 20.18 | 5.25 |
Table: Sample rnorm_multi()
stats
This function creates a vector that has a specified correlation with an existing vector.
# create a pre-existing vector x
x <- rnorm(100, 0, 1)
# create a vector y with exactly mean=0, sd=1, and r(x,y)=0.5
y <- rnorm_pre(x, mu = 0, sd = 1, r = 0.5, empirical = TRUE)
list(
mean = mean(y),
sd = sd(y),
r = cor(x,y)
) %>% str()
#> List of 3
#> $ mean: num -1.24e-17
#> $ sd : num 1
#> $ r : num 0.5
If empirical = FALSE
(the default), this resulting vector is sampled from a population with the specified parameters (but won't have exactly those properties).
Sometimes you want to mess up a dataset for teaching (thanks for the idea, Emily!). The messy()
function will replace prop
proportion of the data in the specified columns with the value of replace
(defaults to NA
).
# replace 10% of Species with NA
iris2 <- messy(iris, 0.1, "Species")
# replace 10% of petal.Width adn Sepal.Width with NA
iris3 <- messy(iris, 0.1, "Petal.Width", "Sepal.Width")
# replace 50% of columns 1-2 with NA
iris4 <- messy(iris, 0.5, 1:2)
# replace 50% of Species with "NOPE"
iris5 <- messy(iris, 0.5, "Species", replace = "NOPE")
If you want to check your simulated stats or just describe an existing dataset, use get_params()
.
get_params(iris)
n | var | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | mean | sd |
---|---|---|---|---|---|---|---|
150 | Petal.Length | 0.87 | -0.43 | 1.00 | 0.96 | 3.76 | 1.77 |
150 | Petal.Width | 0.82 | -0.37 | 0.96 | 1.00 | 1.20 | 0.76 |
150 | Sepal.Length | 1.00 | -0.12 | 0.87 | 0.82 | 5.84 | 0.83 |
150 | Sepal.Width | -0.12 | 1.00 | -0.43 | -0.37 | 3.06 | 0.44 |
You can also group your data and change the digits to round.
get_params(iris,
between = "Species",
digits = 3)
Species | n | var | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | mean | sd |
---|---|---|---|---|---|---|---|---|
setosa | 50 | Petal.Length | 0.27 | 0.18 | 1.00 | 0.33 | 1.46 | 0.17 |
setosa | 50 | Petal.Width | 0.28 | 0.23 | 0.33 | 1.00 | 0.25 | 0.11 |
setosa | 50 | Sepal.Length | 1.00 | 0.74 | 0.27 | 0.28 | 5.01 | 0.35 |
setosa | 50 | Sepal.Width | 0.74 | 1.00 | 0.18 | 0.23 | 3.43 | 0.38 |
versicolor | 50 | Petal.Length | 0.75 | 0.56 | 1.00 | 0.79 | 4.26 | 0.47 |
versicolor | 50 | Petal.Width | 0.55 | 0.66 | 0.79 | 1.00 | 1.33 | 0.20 |
versicolor | 50 | Sepal.Length | 1.00 | 0.53 | 0.75 | 0.55 | 5.94 | 0.52 |
versicolor | 50 | Sepal.Width | 0.53 | 1.00 | 0.56 | 0.66 | 2.77 | 0.31 |
virginica | 50 | Petal.Length | 0.86 | 0.40 | 1.00 | 0.32 | 5.55 | 0.55 |
virginica | 50 | Petal.Width | 0.28 | 0.54 | 0.32 | 1.00 | 2.03 | 0.27 |
virginica | 50 | Sepal.Length | 1.00 | 0.46 | 0.86 | 0.28 | 6.59 | 0.64 |
virginica | 50 | Sepal.Width | 0.46 | 1.00 | 0.40 | 0.54 | 2.97 | 0.32 |
It is useful for IDs for random effects (e.g., subjects or stimuli) to be character strings (so you don't accidentally include them as fixed effects) with the same length s(o you can sort them in order like S01, S02,..., S10 rather than S1, S10, S2, ...) This function returns a list of IDs that have the same string length and a specified prefix.
make_id(n = 10, prefix = "ITEM_")
#> [1] "ITEM_01" "ITEM_02" "ITEM_03" "ITEM_04" "ITEM_05" "ITEM_06" "ITEM_07"
#> [8] "ITEM_08" "ITEM_09" "ITEM_10"
You can also manually set the number of digits and set n
to a range of integers.
make_id(n = 10:20, digits = 3)
#> [1] "S010" "S011" "S012" "S013" "S014" "S015" "S016" "S017" "S018" "S019"
#> [11] "S020"
Convert a data table made with faux from long to wide.
between <- list("pet" = c("cat", "dog"))
within <- list("time" = c("day", "night"))
df_long <- sim_design(within, between, long = TRUE, plot = FALSE)
df_wide <- long2wide(df_long)
id | pet | day | night | |
---|---|---|---|---|
1.S001.1 | S001 | cat | 0.5230799 | 0.5415126 |
1.S002.1 | S002 | cat | -1.2129722 | -0.6874280 |
1.S003.1 | S003 | cat | -0.5516822 | -0.7844739 |
1.S004.1 | S004 | cat | -0.6098251 | -0.6234040 |
1.S005.1 | S005 | cat | 0.2853363 | 2.3027620 |
1.S006.1 | S006 | cat | -0.2490935 | -0.6530662 |
If you have a data table not made by faux, you need to specify the within-subject columns, the between-subject columns, the DV column, and the ID column.
# make a long data table
df_long <- expand.grid(
sub_id = 1:10,
A = c("A1", "A2"),
B = c("B1", "B2")
)
df_long$C <- rep(c("C1", "C2"), 20)
df_long$score <- rnorm(40)
# convert it to wide
df_wide <- long2wide(df_long, within = c("A", "B"),
between = "C", dv = "score", id = "sub_id")
sub_id | C | A1_B1 | A2_B1 | A1_B2 | A2_B2 |
---|---|---|---|---|---|
1 | C1 | -0.3782026 | 0.0048040 | -0.5674500 | 0.6154273 |
2 | C2 | 0.5338898 | -1.6407051 | 1.7815958 | -2.1325543 |
3 | C1 | 1.2792787 | 1.4960844 | 0.6207239 | 1.7913821 |
4 | C2 | 0.1885384 | -0.0884940 | -0.7397000 | 0.0295121 |
5 | C1 | -0.3090509 | -0.3942107 | 0.9916406 | -0.2951152 |
6 | C2 | 1.4168226 | 0.8341336 | 0.0634940 | -1.1571559 |
You can convert a data table made by faux from wide to long easily.
between <- list("pet" = c("cat", "dog"))
within <- list("time" = c("day", "night"))
df_wide <- sim_design(within, between, long = FALSE, plot = FALSE)
df_long <- wide2long(df_wide)
id | pet | time | y | |
---|---|---|---|---|
S001.1 | S001 | cat | day | -0.9012830 |
S002.1 | S002 | cat | day | -0.3205931 |
S003.1 | S003 | cat | day | -0.7733501 |
S004.1 | S004 | cat | day | -1.1861210 |
S005.1 | S005 | cat | day | -1.0848848 |
S006.1 | S006 | cat | day | -0.4888620 |
If you have a data table not made by faux, you need to specify the within-subject factors and columns, and specify the names of the ID and DV columns to create.
If column names are combinations of factor levels (e.g., A1_B1, A1_B2, A2_B1, A2_B2), then you can specify the regex pattern to separate them with the argument sep
(which defaults to _
).
long_iris <- wide2long(
iris,
within_factors = c("feature", "dimension"),
within_cols = 1:4,
dv = "value",
id = "flower_id",
sep = "\\."
)
flower_id | Species | feature | dimension | value | |
---|---|---|---|---|---|
S001.1 | S001 | setosa | Sepal | Length | 5.1 |
S002.1 | S002 | setosa | Sepal | Length | 4.9 |
S003.1 | S003 | setosa | Sepal | Length | 4.7 |
S004.1 | S004 | setosa | Sepal | Length | 4.6 |
S005.1 | S005 | setosa | Sepal | Length | 5.0 |
S006.1 | S006 | setosa | Sepal | Length | 5.4 |
If you have a data table in long format, you can recover the design from it by specifying the dv and id columns (assuming all other columns are within- or between-subject factors).
design <- get_design_long(long_iris, dv = "value", id = "flower_id")
Then you can use json_design()
to save the design to a file or view it in JSON format (condensed or pretty).
json_design(design)
{"within":{"feature":{"Sepal":"Sepal","Petal":"Petal"},"dimension":{"Length":"Length","Width":"Width"}},"between":{"Species":{"setosa":"setosa","versicolor":"versicolor","virginica":"virginica"}},"dv":{"value":"value"},"id":{"flower_id":"flower_id"},"n":{"setosa":50,"versicolor":50,"virginica":50},"mu":{"setosa":{"Sepal_Length":5.006,"Sepal_Width":3.428,"Petal_Length":1.462,"Petal_Width":0.246},"versicolor":{"Sepal_Length":5.936,"Sepal_Width":2.77,"Petal_Length":4.26,"Petal_Width":1.326},"virginica":{"Sepal_Length":6.588,"Sepal_Width":2.974,"Petal_Length":5.552,"Petal_Width":2.026}},"sd":{"setosa":{"Sepal_Length":0.35248969,"Sepal_Width":0.37906437,"Petal_Length":0.173664,"Petal_Width":0.10538559},"versicolor":{"Sepal_Length":0.51617115,"Sepal_Width":0.31379832,"Petal_Length":0.46991098,"Petal_Width":0.19775268},"virginica":{"Sepal_Length":0.63587959,"Sepal_Width":0.32249664,"Petal_Length":0.5518947,"Petal_Width":0.27465006}},"r":{"setosa":[[1,0.74254669,0.26717576,0.27809835],[0.74254669,1,0.17769997,0.23275201],[0.26717576,0.17769997,1,0.33163004],[0.27809835,0.23275201,0.33163004,1]],"versicolor":[[1,0.52591072,0.75404896,0.54646107],[0.52591072,1,0.56052209,0.66399872],[0.75404896,0.56052209,1,0.78666809],[0.54646107,0.66399872,0.78666809,1]],"virginica":[[1,0.45722782,0.86422473,0.28110771],[0.45722782,1,0.40104458,0.53772803],[0.86422473,0.40104458,1,0.32210822],[0.28110771,0.53772803,0.32210822,1]]}}
json_design(design, pretty = TRUE)
{ "within": { "feature": { "Sepal": "Sepal", "Petal": "Petal" }, "dimension": { "Length": "Length", "Width": "Width" } }, "between": { "Species": { "setosa": "setosa", "versicolor": "versicolor", "virginica": "virginica" } }, "dv": { "value": "value" }, "id": { "flower_id": "flower_id" }, "n": { "setosa": 50, "versicolor": 50, "virginica": 50 }, "mu": { "setosa": { "Sepal_Length": 5.006, "Sepal_Width": 3.428, "Petal_Length": 1.462, "Petal_Width": 0.246 }, "versicolor": { "Sepal_Length": 5.936, "Sepal_Width": 2.77, "Petal_Length": 4.26, "Petal_Width": 1.326 }, "virginica": { "Sepal_Length": 6.588, "Sepal_Width": 2.974, "Petal_Length": 5.552, "Petal_Width": 2.026 } }, "sd": { "setosa": { "Sepal_Length": 0.35248969, "Sepal_Width": 0.37906437, "Petal_Length": 0.173664, "Petal_Width": 0.10538559 }, "versicolor": { "Sepal_Length": 0.51617115, "Sepal_Width": 0.31379832, "Petal_Length": 0.46991098, "Petal_Width": 0.19775268 }, "virginica": { "Sepal_Length": 0.63587959, "Sepal_Width": 0.32249664, "Petal_Length": 0.5518947, "Petal_Width": 0.27465006 } }, "r": { "setosa": [ [1, 0.74254669, 0.26717576, 0.27809835], [0.74254669, 1, 0.17769997, 0.23275201], [0.26717576, 0.17769997, 1, 0.33163004], [0.27809835, 0.23275201, 0.33163004, 1] ], "versicolor": [ [1, 0.52591072, 0.75404896, 0.54646107], [0.52591072, 1, 0.56052209, 0.66399872], [0.75404896, 0.56052209, 1, 0.78666809], [0.54646107, 0.66399872, 0.78666809, 1] ], "virginica": [ [1, 0.45722782, 0.86422473, 0.28110771], [0.45722782, 1, 0.40104458, 0.53772803], [0.86422473, 0.40104458, 1, 0.32210822], [0.28110771, 0.53772803, 0.32210822, 1] ] } }
Not all correlation matrices are possible. For example, if variables A and B are correlated with r = 1.0, then the correlation between A and C can only be exactly equal to the correlation between B and C.
The function pos_def_limits()
lets you know what the possible range of values is for the missing value in a correlation matrix with one missing value. The correlation values are entered just from the top right triangle of the matrix, with a single NA
for the missing value.
lims <- pos_def_limits(.8, .2, NA)
x |
---|
-0.42 |
x |
---|
0.74 |
For example, if rAB = 0.8 and rAC = 0.2, then -0.42 <= rBC <= 0.74.
If you enter a correlation matrix that contains impossible combinations, your limits will be NA
.
lims <- pos_def_limits(.8, .2, 0,
-.5, NA,
.2)
x |
---|
NA |
x |
---|
NA |
If you have a full matrix and want to know if it is positive definite, you can use the following code:
c(.2, .3, .4, .2,
.3, -.1, .2,
.4, .5,
.3) %>%
cormat_from_triangle() %>%
is_pos_def()
#> [1] TRUE
matrix(c(1, .3, -.9, .2,
.3, 1, .4, .5,
-.9, .4, 1, .3,
.2, .5, .3, 1), 4) %>%
is_pos_def()
#> [1] FALSE
Please note that the [34m'faux'[39m project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.