/qualtrics-import-workshop

Data and code for Qualtrics data cleaning workshop

Primary LanguageHTML

Qualtrics Import Workshop

Ungson, Nick D.

Created: 2018-11-26 | Last updated: 2018-12-05 09:17:09

In this document, I will walk you through my process of using R to (1) Import and tidy raw Qualtrics data, (2) Calculate variables, and (3) Analyze that data.

This document contains code snippets and their respective output, but you can easily copy and paste the code below and adapt it for your own data and analyses!

Preliminaries

Study Design and Variables

The sample data come from a 3 (Tweet: animal vs. funny vs. baseline) x 2 (Animal: cat vs. dog) mixed design study.

Independent Variables:

  • Tweet was the between-subjects variable: Participants either viewed (1) funny animal-themed tweets [Animal condition], (2) funny non-animal funny tweets [Funny condition], or (3) no tweets [Baseline].
  • Animal was the within-subjects variable. Participants completed the shortened Big Five Inventory (BFI, see below) in the third-“person” with regards to a (1) dog and (2) cat, in counterbalanced order.

Dependent Variable: The BFI has five subscales: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, Neuroticism. Extraversion and Agreeableness were measured using three items; the remaining subscales were measured using two items. Each item was scored on a scale of 1 to 5, with higher scores indicating "more" of that trait.

Other Measured Variables: For each of the six tweets (if participants were assigned to Animal or Funny conditions), participants answered three questions regarding perceived funniness, likilihood to "fav", and likelihood to "share" that tweet on a scale from 1 (not at all funny/not at all likely) to 5 (extremely funny/extremely likely).

Load R Packages

First, load the tidyverse package. If you don't have this package installed, running install.packages("tidyverse") (which is commented out in the code below) will install the package. You only need to install each package once, but you need to load the package using require() or library() (they're the same) every time you start a new R session. The tidyverse package includes and automatically loads other cool packages, like `ggplot and dplyr, so you don't have to individually load those if you are loading the entire "tidyverse," so to speak.

#install.packages("tidyverse")
require(tidyverse)

Import and Tidy

Import raw data

First, import the "fresh" raw data from Qualtrics and save as a dataframe called raw. As long as the current R Markdown document is in the same directory as "raw qualtrics data.csv", the full file name is all you need to put in read.csv().

raw <- read.csv("raw qualtrics data.csv", 
                header = TRUE, 
                stringsAsFactors = FALSE)

Before moving on, convert raw to a tibble (pun on "table"?), which works much more nicely than data frames with the dplyr package (which is automatically loaded as part of the tidyverse). (For more info on tibbles).

# coerce to tibble
raw <- as_tibble(raw)

Tidy

Remove superfluous rows First, let's take a look at the first 10 rows and first 5 columns of raw. You can click the arrows in the output below to scroll through the columns.

raw[1:10, 1:5]
## # A tibble: 10 x 5
##    StartDate          EndDate          Status     IPAddress    Progress   
##    <chr>              <chr>            <chr>      <chr>        <chr>      
##  1 Start Date         End Date         Response ~ IP Address   Progress   
##  2 "{\"ImportId\":\"~ "{\"ImportId\":~ "{\"Impor~ "{\"ImportI~ "{\"Import~
##  3 2018-11-19 13:43:~ 2018-11-19 13:4~ 0          128.180.132~ 100        
##  4 2018-11-19 13:52:~ 2018-11-19 13:5~ 0          128.180.132~ 100        
##  5 2018-11-19 14:03:~ 2018-11-19 14:0~ 0          64.121.123.~ 100        
##  6 2018-11-19 14:05:~ 2018-11-19 14:1~ 0          128.180.107~ 100        
##  7 2018-11-19 14:13:~ 2018-11-19 14:1~ 0          128.180.96.~ 100        
##  8 2018-11-19 14:17:~ 2018-11-19 14:2~ 0          70.167.95.1~ 100        
##  9 2018-11-19 14:24:~ 2018-11-19 14:2~ 0          149.31.125.~ 100        
## 10 2018-11-19 14:39:~ 2018-11-19 14:4~ 0          149.31.82.1~ 100

Something that should jump out at you is that rows 1 and 2 contain unnecessary information. Row 1 simply repeats the question labels; although for later columns, this row contains question wording and you could need them for some studies. Row 2 contains metadata for each question. For our purposes, you don't want either of them. Remove rows 1 and 2 by using indexing ([]) and using the negative sign (-) to exclude rows 1 and 2 (-c(1:2)) and overwrite raw.

# remove 1st and 2nd row (unnecessary qualtrics metadata)
raw <- raw[-c(1:2), ]

Remove superfluous columns Next, let's take a look at the column headings in raw.

colnames(raw)
##  [1] "StartDate"             "EndDate"              
##  [3] "Status"                "IPAddress"            
##  [5] "Progress"              "Duration..in.seconds."
##  [7] "Finished"              "RecordedDate"         
##  [9] "ResponseId"            "RecipientLastName"    
## [11] "RecipientFirstName"    "RecipientEmail"       
## [13] "ExternalReference"     "LocationLatitude"     
## [15] "LocationLongitude"     "DistributionChannel"  
## [17] "UserLanguage"          "anim1_1"              
## [19] "anim1_2"               "anim1_3"              
## [21] "anim2_1"               "anim2_2"              
## [23] "anim2_3"               "anim3_1"              
## [25] "anim3_2"               "anim3_3"              
## [27] "anim4_1"               "anim4_2"              
## [29] "anim4_3"               "anim5_1"              
## [31] "anim5_2"               "anim5_3"              
## [33] "anim6_1"               "anim6_2"              
## [35] "anim6_3"               "fun1_1"               
## [37] "fun1_2"                "fun1_3"               
## [39] "fun2_1"                "fun2_2"               
## [41] "fun2_3"                "fun3_1"               
## [43] "fun3_2"                "fun3_3"               
## [45] "fun4_1"                "fun4_2"               
## [47] "fun4_3"                "fun5_1"               
## [49] "fun5_2"                "fun5_3"               
## [51] "fun6_1"                "fun6_2"               
## [53] "fun6_3"                "dog_extra1r"          
## [55] "dog_extra3"            "dog_agree1"           
## [57] "dog_agree3"            "dog_cons1r"           
## [59] "dog_neur1r"            "dog_open1r"           
## [61] "dog_extra2"            "dog_agree2r"          
## [63] "dog_cons2"             "dog_neur2"            
## [65] "dog_open2"             "cat_extra1r"          
## [67] "cat_extra3"            "cat_agree1"           
## [69] "cat_agree3"            "cat_cons1r"           
## [71] "cat_neur1r"            "cat_open1r"           
## [73] "cat_extra2"            "cat_agree2r"          
## [75] "cat_cons2"             "cat_neur2"            
## [77] "cat_open2"             "age"                  
## [79] "gender"                "gender_fr"            
## [81] "lehigh_status"         "check_manipulation"   
## [83] "condition"

The variable $anim1_2 is the first variable that is part of experimental study. However, you can see that Qualtrics has added 17 variables/columns to the beginning of our data. Sometimes you may want to retain some of them (e.g., $StartDate or $EndDate), but for now let's only keep study progress ($Progress) measured from 0-100, and time of completion ($Duration..in.seconds.) measured in second.

Importantly, since you are now excluding variables, create a new data frame called data and use the select() function to choose which variables to keep from raw and to rename when necessary (e.g., $Duration..in.seconds. is renamed as $time_sec). This way, the raw data remains relatively untouched in case you need to go back to it later.

Note: The code below uses the pipe operator %>% from the dplyr package. Basically, the operator takes whatever came before it (below, the raw data) and subjects it to what comes after: selecting variables using select(). One great thing about dplyr is you won't have to use the $ operator to identify variables as much.

Check out the following guides on dplyr that I return to all the time:

data <- raw %>% 
  select(progress = Progress,
         time_sec = Duration..in.seconds., 
         # select all variables between $anim1_1 and $anim6_3
         anim1_1:anim6_3, 
         # select all variables that contain "fun"
         contains("fun"), 
         # select all dog and cat variables in order I specify
         dog_extra1r, dog_extra2, dog_extra3, 
         dog_agree1, dog_agree2r, dog_agree3, 
         dog_cons1r, dog_cons2, 
         dog_neur1r, dog_neur2, 
         dog_open1r, dog_open2, 
         cat_extra1r, cat_extra2, cat_extra3, 
         cat_agree1, cat_agree2r, cat_agree3, 
         cat_cons1r, cat_cons2, 
         cat_neur1r, cat_neur2, 
         cat_open1r, cat_open2, 
         # select all variables between age and condition
         age:condition)

Great! Now we have a working data dataframe that we can start to mess around with. Remember, you can always do stuff like colnames(data), glimpse(data), or str(data) to investigate anything further.

# look at variables in 'data'
colnames(data)
##  [1] "progress"           "time_sec"           "anim1_1"           
##  [4] "anim1_2"            "anim1_3"            "anim2_1"           
##  [7] "anim2_2"            "anim2_3"            "anim3_1"           
## [10] "anim3_2"            "anim3_3"            "anim4_1"           
## [13] "anim4_2"            "anim4_3"            "anim5_1"           
## [16] "anim5_2"            "anim5_3"            "anim6_1"           
## [19] "anim6_2"            "anim6_3"            "fun1_1"            
## [22] "fun1_2"             "fun1_3"             "fun2_1"            
## [25] "fun2_2"             "fun2_3"             "fun3_1"            
## [28] "fun3_2"             "fun3_3"             "fun4_1"            
## [31] "fun4_2"             "fun4_3"             "fun5_1"            
## [34] "fun5_2"             "fun5_3"             "fun6_1"            
## [37] "fun6_2"             "fun6_3"             "dog_extra1r"       
## [40] "dog_extra2"         "dog_extra3"         "dog_agree1"        
## [43] "dog_agree2r"        "dog_agree3"         "dog_cons1r"        
## [46] "dog_cons2"          "dog_neur1r"         "dog_neur2"         
## [49] "dog_open1r"         "dog_open2"          "cat_extra1r"       
## [52] "cat_extra2"         "cat_extra3"         "cat_agree1"        
## [55] "cat_agree2r"        "cat_agree3"         "cat_cons1r"        
## [58] "cat_cons2"          "cat_neur1r"         "cat_neur2"         
## [61] "cat_open1r"         "cat_open2"          "age"               
## [64] "gender"             "gender_fr"          "lehigh_status"     
## [67] "check_manipulation" "condition"

As part of importing data, R sometimes makes choices about what kind of data type each variable is (e.g., numeric, string, etc.), so the code below will ensure that the variables I want to be numeric are actually recognized as numeric by R. We'll do this using the now-familiar pipe operator (%>%) and mutate_at() (source). For this example, I'm going to make everything numeric except $gender_fr, a free response item on which participants could specify their gender.

# coerce variables to numeric using column index
data <- data %>% 
  mutate_at(vars(1:64, 66:68), as.numeric)

Exclude subjects

For the current study, we will exclude any participants who did not complete the study. First, use table() to look at the frequencies of $progress. (Note: table() is my go-to function for getting frequencies; e.g., gender)

# frequency of progress
table(data$progress)
## 
##   1   2  31  35  40  44  59  75  91  99 100 
##   1   1   3   2   2   1  10   6   1  12 105

105 participants completely finished ($progress == 100), 12 got 99% of the way through, and then there's a smattering of others. This is something you'll have to decide for your studies; for now, exclude anyone who was below 99% completion.

Notice that the code below uses the pipe operator %>% from the dplyr package again. However instead of using select() to identify variables/columns to keep/drop, use filter() to identify participants/rows to keep/drop:

# keep only participants with at least 99% completion
data <- data %>% 
  filter(progress >= 99)

Next, exclude any participants failed the manipulation check; participants were asked to identify which tweet condition they were exposed to. Before that, though, let's see how many participants failed the check using nrow() and indexing. Note that you must use data$ before variables because we are not using dplyr for this:

# in how many rows does check_manipulation not equal condition?
nrow(data[data$check_manipulation != data$condition, ])
## [1] 7

Ok so nrow(data[data$check_manipulation != data$condition, ]) check failures. Use filter() to only include those who passed the check (i.e., if check_manipulation == condition).

data <- data %>% 
  filter(check_manipulation == condition)

Add participant number

Next, add a unique participant number variable. If your own study already includes a participant number variable or unique identifier, this would be unnecessary. But having a unique subject variable for each participant is crucial for calculating participant-level variables (e.g., scale means). You'll be using another dplyr verb: mutate(), which is used to compute new variables:

data <- data %>% 
  mutate(participant = 1:n())

Concatenate across between-subjects variables

The last thing to do tidy this data is to deal with missing values (also known as NAs) that have arisen due to between-subjects variables. For example, participants in the Animal tweet condition made ratings about the animal tweets they saw (e.g., $anim1_1, $anim1_2) but did not answer--and therefore have missing values--for all tweet items corresponding to the Funny tweet condition (e.g., $fun1_1, $fun1_2). As a demonstration of this:

# sample of NA's across animal vs. funny tweet items
data[1:4, c(69, 3:5, 21:22)]
## # A tibble: 4 x 6
##   participant anim1_1 anim1_2 anim1_3 fun1_1 fun1_2
##         <int>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
## 1           1      NA      NA      NA      2      1
## 2           2       4       4       2     NA     NA
## 3           3      NA      NA      NA     NA     NA
## 4           4       4       2       3     NA     NA

Here we can see that participant #1 was in the Funny condition, participant #2 was in the Animal condition, and participant #3 was in the Baseline condition (NA for all tiems). To concatenate across these columns and create new variables, use coalesce() within the now-familiar mutate() function. The code below may look clunky, and I'm sure you could write a function to this more elegantly, but it definitely works.

data <- data %>% 
  mutate(tweet1_1 = coalesce(anim1_1, fun1_1), 
         tweet1_2 = coalesce(anim1_2, fun1_2),
         tweet1_3 = coalesce(anim1_3, fun1_3), 
         tweet2_1 = coalesce(anim2_1, fun2_1), 
         tweet2_2 = coalesce(anim2_2, fun2_2),
         tweet2_3 = coalesce(anim2_3, fun2_3), 
         tweet3_1 = coalesce(anim3_1, fun3_1), 
         tweet3_2 = coalesce(anim3_2, fun3_2),
         tweet3_3 = coalesce(anim3_3, fun3_3), 
         tweet4_1 = coalesce(anim4_1, fun4_1), 
         tweet4_2 = coalesce(anim4_2, fun4_2),
         tweet4_3 = coalesce(anim4_3, fun4_3), 
         tweet5_1 = coalesce(anim5_1, fun5_1), 
         tweet5_2 = coalesce(anim5_2, fun5_2),
         tweet5_3 = coalesce(anim5_3, fun5_3), 
         tweet6_1 = coalesce(anim6_1, fun6_1), 
         tweet6_2 = coalesce(anim6_2, fun6_2),
         tweet6_3 = coalesce(anim6_3, fun6_3))

# if you want to check out the variables
#colnames(data)

Tidy: Done!

Okay the data is now pretty tidy. Many of the operations above were done separately to help explain the logic, but you could easily string many of the dplyr commands together depending on your needs/wants/desires. The code below would produce the exact same tidied data set.

data <- raw %>% 
  select(progress = Progress,
         time_sec = Duration..in.seconds., 
         anim1_1:anim6_3, 
         contains("fun"), 
         dog_extra1r, dog_extra2, dog_extra3, 
         dog_agree1, dog_agree2r, dog_agree3, 
         dog_cons1r, dog_cons2, 
         dog_neur1r, dog_neur2, 
         dog_open1r, dog_open2, 
         cat_extra1r, cat_extra2, cat_extra3, 
         cat_agree1, cat_agree2r, cat_agree3, 
         cat_cons1r, cat_cons2, 
         cat_neur1r, cat_neur2, 
         cat_open1r, cat_open2, 
         age:condition) %>% 
  mutate_at(vars(1:64, 66:68), as.numeric) %>% 
  filter(progress >= 99) %>% 
  filter(check_manipulation == condition) %>% 
  mutate(participant = 1:n(), 
         tweet1_1 = coalesce(anim1_1, fun1_1), 
         tweet1_2 = coalesce(anim1_2, fun1_2),
         tweet1_3 = coalesce(anim1_3, fun1_3), 
         tweet2_1 = coalesce(anim2_1, fun2_1), 
         tweet2_2 = coalesce(anim2_2, fun2_2),
         tweet2_3 = coalesce(anim2_3, fun2_3), 
         tweet3_1 = coalesce(anim3_1, fun3_1), 
         tweet3_2 = coalesce(anim3_2, fun3_2),
         tweet3_3 = coalesce(anim3_3, fun3_3), 
         tweet4_1 = coalesce(anim4_1, fun4_1), 
         tweet4_2 = coalesce(anim4_2, fun4_2),
         tweet4_3 = coalesce(anim4_3, fun4_3), 
         tweet5_1 = coalesce(anim5_1, fun5_1), 
         tweet5_2 = coalesce(anim5_2, fun5_2),
         tweet5_3 = coalesce(anim5_3, fun5_3), 
         tweet6_1 = coalesce(anim6_1, fun6_1), 
         tweet6_2 = coalesce(anim6_2, fun6_2),
         tweet6_3 = coalesce(anim6_3, fun6_3))

Calculate Variables

Now, it's time to calculate some variables:

  • BFI: Extraversion (cat and dog)
  • BFI: Openness to Experience (cat and dog)
  • BFI: Conscientiousness (cat and dog)
  • BFI: Neuroticism (cat and dog)
  • BFI: Agreeableness (cat and dog)
  • Positive evaluation of tweets

BFI dimensions

Extraversion was measured for each animal using 3 items:

  • $dog_extra1r (reverse-coded)
  • $dog_extra2
  • $dog_extra3

Below, for each participant, calculate the mean dog extraversion by averaging these three items in a new variable $dog_extra. Some things to note:

  • group_by() is used to ensure that whatever comes afterwards is applied separately to each "group" specified. In other words, if you did not include group_by(participant), you would calculate the grand dog extra version of mean across all participants (i.e., everyone would have the same score). By including group_by(participant), you make sure that the mean calculation occurs for every "group" (participant). By this logic, you could also use group_by(condition) and this code would give you the mean dog extraversion for each of the three tweet conditions.
  • (6 - dog_extra1r) refers to the reverse-scoring of that item
  • na.rm = TRUE tells R to ignore missing values. Otherwise, if a participant happened to skip one dog extraversion item, no mean would be calculated and NA would be returned for that participant. This way, R calculates the mean from all available data.
  • The code below also calculates cat extraversion, $cat_extra, using the same logic; remmber, multiple variables can be created inside mutate() as long as you separate with commas!
data <- data %>% 
  group_by(participant) %>% 
  mutate(dog_extra = mean((6 - dog_extra1r), dog_extra2, dog_extra3, na.rm = TRUE), 
         cat_extra = mean((6 - cat_extra1r), cat_extra2, cat_extra3, na.rm = TRUE))

Neuroticism was measured for each animal using 2 items:

  • $dog_neur1r (reverse-coded)
  • $dog_neur2

Create means in the exact same way:

data <- data %>% 
  group_by(participant) %>% 
  mutate(dog_neur = mean((6 - dog_neur1r), dog_neur2, na.rm = TRUE), 
         cat_neur = mean((6 - cat_neur1r), cat_neur2, na.rm = TRUE))

...you could then easily do the same thing for the rest of the BFI variables:

data <- data %>% 
  group_by(participant) %>% 
  mutate(dog_agree = mean((6 - dog_agree2r), dog_agree1, dog_agree3, na.rm = TRUE), 
         cat_agree = mean((6 - cat_agree2r), cat_agree1, cat_agree3, na.rm = TRUE), 
         dog_cons = mean((6 - dog_cons1r), dog_cons2, na.rm = TRUE), 
         cat_cons = mean((6 - cat_cons1r), cat_cons2, na.rm = TRUE), 
         dog_open = mean((6 - dog_open1r), dog_open2, na.rm = TRUE), 
         cat_open = mean((6 - cat_open1r), cat_open2, na.rm = TRUE))

Tweet evaluation

data <- data %>% 
  group_by(participant) %>% 
  mutate(tweet_eval = mean(tweet1_1, tweet1_2, tweet1_3, 
                           tweet2_1, tweet2_2, tweet2_3, 
                           tweet3_1, tweet3_2, tweet3_3, 
                           tweet4_1, tweet4_2, tweet4_3, 
                           tweet5_1, tweet5_2, tweet5_3, 
                           tweet6_1, tweet6_2, tweet6_3, na.rm = TRUE))

Re-order variables

This is not strictly necessary, but I like to do one last "tidying" measure before starting with analyses: remove now-unnecessary variables (e.g., between-subjects items that are useless now that we've used coalesce() to concatenate; progress) and order variables to my liking. Remember, the order of variables passed through select() is the order they will be placed into resulting data.

data <- data %>% 
  select(participant, condition, 
         # all calculated variables
         dog_extra:tweet_eval, 
         # demographics, etc.
         time_sec, age:lehigh_status, 
         # individual scale items (retain for reliability analyses)
         tweet1_1:cat_open)

Take a moment to apreciate your clean data set:

colnames(data)
##  [1] "participant"   "condition"     "dog_extra"     "cat_extra"    
##  [5] "dog_neur"      "cat_neur"      "dog_agree"     "cat_agree"    
##  [9] "dog_cons"      "cat_cons"      "dog_open"      "cat_open"     
## [13] "tweet_eval"    "time_sec"      "age"           "gender"       
## [17] "gender_fr"     "lehigh_status" "tweet1_1"      "tweet1_2"     
## [21] "tweet1_3"      "tweet2_1"      "tweet2_2"      "tweet2_3"     
## [25] "tweet3_1"      "tweet3_2"      "tweet3_3"      "tweet4_1"     
## [29] "tweet4_2"      "tweet4_3"      "tweet5_1"      "tweet5_2"     
## [33] "tweet5_3"      "tweet6_1"      "tweet6_2"      "tweet6_3"

Optional: Export Data

At this point, you may want to export data in it's cleaned form, perhaps as a .csv file. You can use the write.csv() function for that (source). Something to remember is to set row.names = FALSE to prevent R from creating a new column with row numbers in your new .csv file. Besides, we already have $participant anyway!

Below is an example of how you might go about it:

write.csv(data,
          file = "2018-12-05 clean data.csv", 
          row.names = fALSE)

Analyze

Descriptives

There are few of ways to get descriptive statistics for your variables. I'll demonstrate a couple of them here.

You can get them "individually", like so:

mean(data$age)
## [1] 32.06364
sd(data$age)
## [1] 10.05971

You can use functions to return several statistics. For example, the stat.desc() function in the pastecs package:

#install.packages("pastecs")
require(pastecs)
stat.desc(data$age)
##      nbr.val     nbr.null       nbr.na          min          max 
##  110.0000000    0.0000000    0.0000000   20.0000000   72.0000000 
##        range          sum       median         mean      SE.mean 
##   52.0000000 3527.0000000   30.0000000   32.0636364    0.9591556 
## CI.mean.0.95          var      std.dev     coef.var 
##    1.9010153  101.1977481   10.0597091    0.3137420

A fun thing about arrays like the one stat.desc() spits out is that you can use [] to index and pull out specific values. For example, if you wanted to just return the mean of $age, you know that mean is the 9th thing in that object, so just pull it out!

# see what's "inside" stat.desc(data$age)
names(stat.desc(data$age))
##  [1] "nbr.val"      "nbr.null"     "nbr.na"       "min"         
##  [5] "max"          "range"        "sum"          "median"      
##  [9] "mean"         "SE.mean"      "CI.mean.0.95" "var"         
## [13] "std.dev"      "coef.var"
# pull out mean
stat.desc(data$age)[9]
##     mean 
## 32.06364

Another way is using the describe() function from the psych package:

require(psych)
## Loading required package: psych

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(data$age)
##    vars   n  mean    sd median trimmed  mad min max range skew kurtosis
## X1    1 110 32.06 10.06     30   30.09 2.97  20  72    52 1.99     3.71
##      se
## X1 0.96

And the last way demonstrated here is using our old friend dplyr, but now using the summarize() function. (As a fun twist, I've added group_by(condition), which shows us the mean age by condition. Not that, since <- does not appear anywhere in the code below, nothing is saved or overwritten; the results are only printed.

# mean age for each condition
data %>% 
  group_by(condition) %>% 
  summarize(age_m = mean(age, na.rm = TRUE))
## # A tibble: 3 x 2
##   condition age_m
##       <dbl> <dbl>
## 1         1  31.1
## 2         2  31.6
## 3         3  33.3

ANOVA

Run a one-way ANOVA on perceived extraversion: Did participants in different Tweet conditions perceive more extraversion in animals, averaging across dogs and cats?

There are many ways to conduct all types of analyses in R, but here we'll use the simple aov() function (source).

First, I'm going to calculate a new variable that is mean extra version across dogs and cats.

data <- data %>% 
  group_by(participant) %>% 
  mutate(all_extra = mean(dog_extra, cat_extra, na.rm = TRUE))

Then use aov(), whose basic form is aov(dv ~ iv, data = name_of_data) and save the results in an object named aov_extra; this will let us continue to use that object for future analyses.

# save object
aov_extra <- aov(all_extra ~ condition, data = data)

# get summary of results
summary(aov_extra)
##              Df Sum Sq Mean Sq F value Pr(>F)
## condition     1   0.31  0.3095    0.24  0.625
## Residuals   108 139.01  1.2871
# tip: code below would simultaneously save and get summary
#summary(aov_extra <- aov(all_extra ~ condition, data = data))

So it doesn't look like there is a main effect of $condition on extraversion, averaged across dog and cat. We can see this is the case by looking at the descriptives across conditions using group_by(condition) and summarize():

data %>% 
  group_by(condition) %>% 
  summarize(extra = mean(all_extra, na.rm = TRUE), 
            extra_sd = sd(all_extra, na.rm = TRUE))
## # A tibble: 3 x 3
##   condition extra extra_sd
##       <dbl> <dbl>    <dbl>
## 1         1  3.76     1.16
## 2         2  3.66     1.23
## 3         3  3.88     1.03

^ Not surprising that the ANOVA was non-significant.

T-test

Now do a paired samples t-test to see if people differed in their perceived extraversion of dogs versus cats. For this, use t.test(), whose basic form is t.test(dv1, dv2, paired = TRUE) (source).

The code below does not do this, but you could always save results to an object (e.g., extra_t) to mess with later.

t.test(data$dog_extra, data$cat_extra, paired = TRUE)
## 
##  Paired t-test
## 
## data:  data$dog_extra and data$cat_extra
## t = 6.5896, df = 109, p-value = 1.618e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.6483757 1.2061698
## sample estimates:
## mean of the differences 
##               0.9272727

Definitely significant. And the group means:

mean(data$dog_extra)
## [1] 3.772727
mean(data$cat_extra)
## [1] 2.845455

Simple regression

Now try a simple linear regression. Regress perceived cat openness ($cat_open) onto age; do older/younger people perceive cats as having different openness to experience?

Most regression uses the lm() function, whose basic form is lm(dv ~ iv, data = data). For regression analyses (and most others), you specify the model and can save the results in an object that can be modfied/explored further. In this regression example, save the regression into the object cat_open_lm and examine using summary().

# define model
cat_open_lm <- lm(cat_open ~ age, data = data)

# summarize / get results
summary(cat_open_lm)
## 
## Call:
## lm(formula = cat_open ~ age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.95119 -0.94228  0.06233  1.06110  2.10288 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.975764   0.448674   6.632 1.35e-09 ***
## age         -0.001229   0.013357  -0.092    0.927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.403 on 108 degrees of freedom
## Multiple R-squared:  7.836e-05,  Adjusted R-squared:  -0.00918 
## F-statistic: 0.008464 on 1 and 108 DF,  p-value: 0.9269

Welp, not significant!

ABSOLUTELY TANGENTIAL THING: Fun exploration of lm objects

As a fun thing, you can "investigate" lm() objects like our cat_open_lm. Firstly, use names() to see the names of everything "inside" it.

names(cat_open_lm)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Cool! So all those things are kind of like "variables" inside the object cat_open_lm, which means we can use the $ operator to "pull them out". For example, let's say I just wanted to get the list of regression coefficients, I could look only at $coefficients inside cat_open_lm.

cat_open_lm$coefficients
##  (Intercept)          age 
##  2.975764014 -0.001228818

Very cool! That gives us a matrix with the variable labels ("Intercept" and "age") in the first row, and their respective regression coefficients in the second row. OKAY but let's say I only wanted the regression coefficient for $age. I can use indexing ([]) to pull it out specifically by calling the label on top of that column, "age".

cat_open_lm$coefficients["age"]
##          age 
## -0.001228818
# I could also use the column number for the same result
cat_open_lm$coefficients[2]
##          age 
## -0.001228818

Plot regression line. We can use R to plot the relationship between age and perceived cat openness (source). There are more fancy ways to do this, but this is a simple (if not the prettiest) way to do it. First, plot a simple scatterplot of the points using the plot() function; then add the line of best fit to the scatterplot by using the abline() function. Notice that, inside the abline(), you should something similar to what we did when using lm() to create cat_open_lm: cat openness regressed onto age, data$cat_open ~ data$age.

plot(x = data$age, 
     y = data$cat_open, 
     xlab = "Age", 
     ylab = "Perceived Cat Openness")

abline(lm(data$cat_open ~ data$age), col = "blue")

Advanced analyses, etc.

As you've no doubt noticed, a more appropriate analysis would be 3x3 mixed ANOVA with Tweet and Animal as the factors...however, mixed designs require you to restructure your data into "long format", which is something for another day. But you would probably use something like the gather() function in the tidyr package. (source)(source)

Once your data is in that format, I would probably use ezANOVA() function in the ez package. (source)

Conclusion

Well... that's it. It was a lot, and not very in-depth, but you should now have a pretty good idea of how to clean/examine/analyze your data in the future!