In Part 1 of the tutorial, we learned how to read in output from Gorilla and extract the data we need. To recap, we used:
read.csv()
to read in the data.filter()
to keep one row per trial, containing the participant response.select()
to keep the relevant columns.set_names()
or rename()
to name the columns in a more intuitive way.group_by()
and summarise()
to calculate summary statistics for each participant.We also used head()
and count()
to keep checking that our data processing looked sensible.
Now we’ve got to grips with the basics, here are some extra sections on making the most of the tidyverse functions, and scaling up to more complex datasets.
If you want to follow along with this tutorial using the data files, you can pick up right where we left off after Part 1.
If you no longer have the dataframes stored in your R session, you may wish to recap Part 1 as a reminder. Alternatively, we can re-load the processed data files that we saved at the end.
trial_data <- read.csv("./output/trial_data.csv")
participant_data <- read.csv("./output/participant_averages.csv")
Make sure you have the tidyverse
library loaded!
The example used in Part 1 was a very simple one, designed to get you started with data processing using the tidyverse. However, you more than likely have more than one experimental condition in your data.
You might have been sensible enough to label these somehow in your Gorilla spreadsheet, in which case you can make sure you include the relevant columns in your variable selection, and use it alongside ID
in your group_by()
function when creating participant means (as we will below). But I did not.
Instead, I have a separate csv file that documents which item came from which condition (as this was the same across participants). If we load this in, you can see that it lists each item
and the neighb
condition that it belongs to. There were three conditions in this experiment: whether the pseudoword has no, one, or many neighbours in the English language.
items <- read.csv("./story_materials/item-conds.csv")
items
## item neighb
## 1 parung none
## 2 tesdar none
## 3 femod none
## 4 peflin none
## 5 vorgal none
## 6 solly many
## 7 dester many
## 8 nusty many
## 9 mowel many
## 10 ballow many
## 11 regby one
## 12 wabon one
## 13 tabric one
## 14 pungus one
## 15 rafar one
I can then use the item
column to merge it with the items in my trial-level data.
trial_data_conds <- trial_data %>%
left_join(items, by = "item")
Here, we’ve said to take the trial_data
dataframe, and join it to the items
dataframe using the item
column common to both dataframes (i.e., it will match each row based on the content of the item
column). Using left_join
means that we want to keep all rows in the first dataframe we give to the argument (trial_data
, fed into the function via the pipe). The second dataframe (items
) will therefore be repeated across all matches.
head(trial_data_conds)
## ID item acc RT neighb
## 1 508739 vorgal 0 14449.2 none
## 2 508739 mowel 1 8980.6 many
## 3 508739 peflin 1 3813.7 none
## 4 508739 dester 0 6945.2 many
## 5 508739 wabon 1 12621.1 one
## 6 508739 ballow 0 6757.5 many
Now, when creating our participant means, we specify that we want a summary statistic for each neighb
condition. To do this, we just add an extra argument to the group_by()
function. We say we want the summarise()
function to work on each combination of participant ID and neighbour condition.
participant_data_conds <- trial_data_conds %>%
group_by(ID, neighb) %>%
summarise(meanAcc = mean(acc), meanRT = mean(RT, na.rm = TRUE))
head(participant_data_conds)
## # A tibble: 6 x 4
## # Groups: ID [2]
## ID neighb meanAcc meanRT
## <int> <fct> <dbl> <dbl>
## 1 508739 many 0.2 7304.
## 2 508739 none 0.6 5428.
## 3 508739 one 0.6 8486.
## 4 508745 many 0.4 4165.
## 5 508745 none 0.8 4913.
## 6 508745 one 0.6 4084.
This gives a row for each participant for each condition. If we wanted, we could rearrange this dataset to be one row per participant using the pivot_wider()
function. This allows us to specify the unique identifier we want for each row (the participant ID
), how we want to organise the columns (neighb
), and the values that we want in those columns (both meanAcc
and meanRT
). We will cover the pivot_wider()
function in more detail below.
participant_data_conds_w <- participant_data_conds %>%
pivot_wider(id_cols = ID, names_from = neighb, values_from = c(meanRT, meanAcc))
head(participant_data_conds_w)
## # A tibble: 6 x 7
## # Groups: ID [6]
## ID meanRT_many meanRT_none meanRT_one meanAcc_many meanAcc_none meanAcc_one
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 508739 7304. 5428. 8486. 0.2 0.6 0.6
## 2 508745 4165. 4913. 4084. 0.4 0.8 0.6
## 3 508749 4372. 5894. 4873. 0.6 1 1
## 4 508754 4836. 8059. 5231. 0.6 0.6 0.4
## 5 508757 5752. 6730. 3554. 1 1 1
## 6 508942 3968. 3711. 5163. 0.6 0.6 0.6
Now we have an RT column for participant averages in each neighbour condition (many, none, one), and the same again for accuracy.
mutate()
The mutate()
function allows us to create a new column based on the content of other columns (much like writing a formula in an Excel spreadsheet, or computing a new variable in SPSS). It always follows the structure new_name = formula
, with each set separated with a comma if you are computing more than one column at once.
As an example, let’s take our condition means data from above, with one column per condition. We could now compute the differences in performance between the many and no neighbour conditions, for each participant.
participant_data_conds_w <- participant_data_conds_w %>%
mutate(diffAcc = meanAcc_many - meanAcc_none,
diffRT = meanRT_many - meanRT_none)
Here we have said:
participant_data_conds_w
data frame, AND THEN…diffAcc
that is equal to the value of each participant’s accuracy score for the “many” condition, minus their score for the “none” condition.diffRT
that is equal to the value of each participant’s RT for the “many” condition, minus their RT for the “none” condition.As before, we reassigned the dataframe back to itself (participant_data_conds_w <-
) to save the version with the new column added.
Let’s take a look and check:
participant_data_conds_w %>%
select(-meanAcc_one, -meanRT_one) %>% # remove one-neighbour condition columns from preview so prints on one line
head()
## # A tibble: 6 x 7
## # Groups: ID [6]
## ID meanRT_many meanRT_none meanAcc_many meanAcc_none diffAcc diffRT
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 508739 7304. 5428. 0.2 0.6 -0.40 1875.
## 2 508745 4165. 4913. 0.4 0.8 -0.4 -748.
## 3 508749 4372. 5894. 0.6 1 -0.4 -1522.
## 4 508754 4836. 8059. 0.6 0.6 0 -3223.
## 5 508757 5752. 6730. 1 1 0 -978.
## 6 508942 3968. 3711. 0.6 0.6 0 257.
You can see that we now have two extra columns at the end, and that mutate()
has performed the operations separately for each row. Again, it’s always a good idea to check a few of these manually to make sure they are as you would expect.
One of the advantages of scripting your data processing before/early in data collection is that you can keep on top of whether you need to replace participants who perform too poorly (or too well) to be included in your analysis. The mutate()
function can help us here too.
For example, in this dataset, chance level performance was 25% (there were four answer options), and we might want to exclude participants who performed below this level as we suspect they were not paying attention during the experiment. Here, we can use mutate()
to flag participants who don’t meet the criteria.
participant_data <- participant_data %>%
mutate(eligibility = ifelse(meanAcc > .25, 1, 0))
I have called the new column eligibility
, and used an ifelse()
function to determine what goes in the column (very much like the “IF” formula you might have come across in Excel). It says, if participant’s mean accuracy (meanAcc
) is greater than .25, assign a 1 to the column. Else, assign a 0.
We can then use our new column to quickly see how many eligible participants we have collected so far. We group the data by eligibility, and count the number of participants in each.
participant_data %>%
group_by(eligibility) %>%
count()
## # A tibble: 2 x 2
## # Groups: eligibility [2]
## eligibility n
## <dbl> <int>
## 1 0 1
## 2 1 51
We can see that one participant will not be eligible for our analyses, and can release another participant slot online to replace them.
Whilst we’re at it, combining mutate()
and ifelse()
is also very handy for excluding data points we don’t want. For example, in cognitive tasks, we often only care about reaction times for the trials that the participant answered correctly.
So if we go back to our trial-level data, we can create another version of the RT
data that only includes the value if the accuracy column showed the trial as correct (1).
trial_data <- trial_data %>%
mutate(accRT = ifelse(acc == 1, RT, NA))
Here we have said:
trial_data
dataframe, by taking trial_data
, AND THEN…accRT
, which for each row consists of the following:
acc
is equal to 1, use the value in the RT
columnacc
is NOT equal to 1), use NA to record it as missing dataLet’s check if it worked:
head(trial_data)
## ID item acc RT accRT
## 1 508739 vorgal 0 14449.2 NA
## 2 508739 mowel 1 8980.6 8980.6
## 3 508739 peflin 1 3813.7 3813.7
## 4 508739 dester 0 6945.2 NA
## 5 508739 wabon 1 12621.1 12621.1
## 6 508739 ballow 0 6757.5 NA
We can see that the accRT
column only contains the reaction times for the trials that were answered correctly. Now, when computing our participant averages, we can use the accRT
column so that it’s only incorporating RTs for correct answers. Remember to include na.rm = TRUE
to ignore missing values!
participant_data <- trial_data %>%
group_by(ID) %>%
summarise(meanAcc = mean(acc),
meanRT = mean(accRT, na.rm = TRUE))
bind_rows()
and _join()
It might be that we have the same task presented across different experiment nodes in Gorilla. If we want to treat all files the same, we can read them in and append them to each other before we carry out the data processing steps. In this dataset, participants were randomised into different counterbalancing conditions at the start of the experiment: different groups of participants learned and were tested on different pseudoword-object pairings. However, the task set up was identical, so we want to piece the different output files back together.
To do this, we can first create a list of the files that we want. Here, I’ve added a second version of the task from a different counterbalancing condition.
files <- c("./story_materials/data_exp_4424-v9_task-y5i7.csv",
"./story_materials/data_exp_4424-v9_task-lwe7.csv")
Or, if you want to read in all of the csv files in a particular folder, you can also create this list more efficiently using the list.files()
function. It takes information about the file path (so I’ve pointed it to my folder of Story Materials), and a pattern to match to identify relevant files (I’ve specified I want the “.csv” files only).
allfiles <- list.files(path = "./story_materials/", pattern = ".csv")
However, we’ll stick to our two files for now. We want to read in each file, and bind them together:
raw_data_comb <- lapply(files, read.csv) %>%
bind_rows()
The function lapply
is a base R function, which applies the same function across a list of objects. Here, we’ve asked it to apply the read.csv
function to the list of files
that we created. We then use the pipe operator %>%
to feed the output to the bind_rows()
function, which sticks it all together. Et voila!
We can use nrow()
to check that our new combined version contains more observations than our first one did. You can also see this by checking the number of observations listed for the dataframe in the environment pane.
# Original raw data file
raw_data <- read.csv("./story_materials/data_exp_4424-v9_task-y5i7.csv")
nrow(raw_data)
## [1] 1299
# New combined data file
nrow(raw_data_comb)
## [1] 2537
Note: if you are running this yourself, R might spit some red warnings at you to tell you that it’s coerced one of your variables into a character type. This looks scary, but there’s no need to panic! This is simply letting you know how it’s interpreted the information, and can be a possible thing to investigate if your output isn’t as you expect.
Often, we will have collected data about our participants across different tasks, and will want to merge that information together. This could be anything from the basic demographic information that we collected at the start of the experiment to their performance on a different task.
For an example here, we will use our original set of participants above (participant_data
), and match up their performance the next day (day2_data
). I’ve processed this file in the same way that we did in Part 1 to produce day2 scores.
day2_data <- read.csv("./story_materials/data_exp_4424-v9_task-z5kw.csv") %>%
filter(Zone.Type == "response_button_text") %>%
select(Participant.Private.ID, ANSWER, Correct, Reaction.Time) %>%
set_names(c("ID", "item", "acc", "RT")) %>%
group_by(ID) %>%
summarise(day2Acc = mean(acc), day2RT = mean(RT, na.rm = TRUE))
head(day2_data)
## # A tibble: 6 x 3
## ID day2Acc day2RT
## <int> <dbl> <dbl>
## 1 508739 0.6 3573.
## 2 508745 0.667 3408.
## 3 508749 0.667 4356.
## 4 508754 0.667 4634.
## 5 508757 0.733 4090.
## 6 508942 0.667 4723.
The aim here is to merge this day2_data
with the participant_data
dataset we made earlier, such that we have all information for one participant on a single row.
full_data <- participant_data %>%
full_join(day2_data, by = "ID")
In the code above, we have created a new dataframe (full_data
) by taking participant_data
and completing a full join with the day2_data
. By using a full_join()
(rather than the left_join()
we used above), all rows from each dataset will be kept, even if they don’t have a match in the other. This means that participants missing information on a particular task won’t just disappear from your dataset.
We can check that our merge has been successful by previewing the dataset:
head(full_data)
## # A tibble: 6 x 5
## ID meanAcc meanRT day2Acc day2RT
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 508739 0.467 6747. 0.6 3573.
## 2 508745 0.6 4352. 0.667 3408.
## 3 508749 0.867 5006. 0.667 4356.
## 4 508754 0.533 5130. 0.667 4634.
## 5 508757 1 5345. 0.733 4090.
## 6 508942 0.6 4413. 0.667 4723.
It’s also a good idea to check that you haven’t lost (or gained!) participants by accident. You can do this by looking at the number of observations documented for the dataframes in the Environment Pane (e.g., “52 obs.”), or by printing out the number of rows for each version as we did above.
nrow(participant_data)
## [1] 52
nrow(day2_data)
## [1] 52
nrow(full_data)
## [1] 52
We can see that here they are the same, as all participants completed both the first and second activity, and the merge has been successful in merging these together. If the numbers mismatch in your own dataset, you will want to think about why—did some participants have missing data across the tasks? Have any rows not successfully merged (e.g., due to an error in the participant ID number)?
pivot_()
Finally, it’s useful to know how to rearrange your data so that it can be used for different kinds of analysis. Many statistical functions in R like “long format” data—one trial per participant per row, as in our trial_data
above. But what if you want to inspect patterns of performance across items? “Wide format” data—all data for one participant on a single row—may be helpful for saving out your data in a concise way, or for certain types of analysis (e.g., PCA).
The tidyverse pivot_wider()
and pivot_longer()
functions are very helpful here. You might also see reference to spread()
and gather()
if you look for help on the internet—these work very similarly, but will eventually be retired and replaced by the pivot functions.
For example, in the code below, we take our trial_data
(using the accuracy information only, the -
in the select()
function allows us to drop the RT information). We then tell pivot_wider()
that it should take column names from the current item
column, and use the values in the acc
column to fill them.
participant_items <- trial_data %>%
select(-RT, -accRT) %>%
pivot_wider(names_from = item, values_from = acc)
We can check that it worked by previewing the dataset, and by checking that we now have one row per participant.
head(participant_items)
## # A tibble: 6 x 16
## ID vorgal mowel peflin dester wabon ballow pungus tesdar rafar femod parung tabric regby solly nusty
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 508739 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0
## 2 508745 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1
## 3 508749 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1
## 4 508754 0 1 1 1 1 0 0 1 0 0 1 0 1 0 1
## 5 508757 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 6 508942 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1
nrow(participant_items)
## [1] 52
And whilst we’re at it, what if we want to go from the “wide” dataset back to “long” again? We tell pivot_longer()
to use all columns apart from the ID column (cols = -ID
); the column names should go back to an item
column, and the values should go in an acc
column.
items_back <- participant_items %>%
pivot_longer(cols = -ID, names_to = "item", values_to = "acc")
head(items_back)
## # A tibble: 6 x 3
## ID item acc
## <int> <chr> <int>
## 1 508739 vorgal 0
## 2 508739 mowel 1
## 3 508739 peflin 1
## 4 508739 dester 0
## 5 508739 wabon 1
## 6 508739 ballow 0
We can see now that the number of rows in the returned version match our original trial level data:
nrow(items_back)
## [1] 780
nrow(trial_data)
## [1] 780
I hope you have seen that the tidyverse tools can be helpful in efficiently processing new datasets. In the initial example, we used the following functions to process our data:
read.csv()
to read in the data.filter()
to keep one row per trial, containing the participant response.select()
to keep the relevant columns.set_names()
or rename()
to name the columns in a more intuitive way.group_by()
and summarise()
to calculate summary statistics for each participant. These are also useful when producing descriptive statistics across the whole sample.Beyond this, we also learned a few other helpful tools for working with the data and applying your knowledge to more complex datasets:
mutate()
to create new variables based on the content of other columns. We also combined this with ifelse()
to create variables based on certain conditions.left_join()
and full_join()
to join dataframes together horizontally (on the basis of matching row IDs).bind_rows()
to “stack” dataframes on top of each other, adding extra rows (on the basis of matching columns).pivot_wider()
and pivot_longer()
to rearrange data between long and wide formats.We’ve also used head()
, count()
, and nrow()
throughout to keep checking that our data processing looked sensible.
Just as we saw at the end of Part 1, using the pipe operator means that you can easily incorporate these steps into your data processing—feeding from one step to the next. All in one tidy and efficient block of code!
If you get stuck with the material here, I encourage you to email me emma.james@york.ac.uk—I would appreciate the feedback in helping to make this material clearer and more accessible.
More broadly speaking, there are some key sources of help when you get stuck with your R tasks.
help(function)
in the console to bring up its documentation (replacing function
with the function name, e.g., help(left_join)
.On that last note, I wish to provide a final reassurance to embrace the errors and the warnings. Sometimes they are warnings that you can choose to ignore (as we did above when binding the dataframes), but they are useful to investigate if things aren’t turning out as you expect. These are readily copy and pasted into Google, where you can find more information to understand it better. It doesn’t mean that you’ve broken R!