Introduction

In Part 1 of the tutorial, we learned how to read in output from Gorilla and extract the data we need. To recap, we used:

• read.csv() to read in the data.
• filter() to keep one row per trial, containing the participant response.
• select() to keep the relevant columns.
• set_names() or rename() to name the columns in a more intuitive way.
• group_by() and summarise() to calculate summary statistics for each participant.

We also used head() and count() to keep checking that our data processing looked sensible.

Now we’ve got to grips with the basics, here are some extra sections on making the most of the tidyverse functions, and scaling up to more complex datasets.

Preparing for this tutorial

If you want to follow along with this tutorial using the data files, you can pick up right where we left off after Part 1.

If you no longer have the dataframes stored in your R session, you may wish to recap Part 1 as a reminder. Alternatively, we can re-load the processed data files that we saved at the end.

trial_data <- read.csv("./output/trial_data.csv")
participant_data <- read.csv("./output/participant_averages.csv")

Make sure you have the tidyverse library loaded!

Dealing with more than one experimental condition

The example used in Part 1 was a very simple one, designed to get you started with data processing using the tidyverse. However, you more than likely have more than one experimental condition in your data.

Merging condition information

You might have been sensible enough to label these somehow in your Gorilla spreadsheet, in which case you can make sure you include the relevant columns in your variable selection, and use it alongside ID in your group_by() function when creating participant means (as we will below). But I did not.

Instead, I have a separate csv file that documents which item came from which condition (as this was the same across participants). If we load this in, you can see that it lists each item and the neighb condition that it belongs to. There were three conditions in this experiment: whether the pseudoword has no, one, or many neighbours in the English language.

items <- read.csv("./story_materials/item-conds.csv")
items
##      item neighb
## 1  parung   none
## 2  tesdar   none
## 3   femod   none
## 4  peflin   none
## 5  vorgal   none
## 6   solly   many
## 7  dester   many
## 8   nusty   many
## 9   mowel   many
## 10 ballow   many
## 11  regby    one
## 12  wabon    one
## 13 tabric    one
## 14 pungus    one
## 15  rafar    one

I can then use the item column to merge it with the items in my trial-level data.

trial_data_conds <- trial_data %>%
left_join(items, by = "item")

Here, we’ve said to take the trial_data dataframe, and join it to the items dataframe using the item column common to both dataframes (i.e., it will match each row based on the content of the item column). Using left_join means that we want to keep all rows in the first dataframe we give to the argument (trial_data, fed into the function via the pipe). The second dataframe (items) will therefore be repeated across all matches.

head(trial_data_conds)
##       ID   item acc      RT neighb
## 1 508739 vorgal   0 14449.2   none
## 2 508739  mowel   1  8980.6   many
## 3 508739 peflin   1  3813.7   none
## 4 508739 dester   0  6945.2   many
## 5 508739  wabon   1 12621.1    one
## 6 508739 ballow   0  6757.5   many

One participant score per condition

Now, when creating our participant means, we specify that we want a summary statistic for each neighb condition. To do this, we just add an extra argument to the group_by() function. We say we want the summarise() function to work on each combination of participant ID and neighbour condition.

participant_data_conds <- trial_data_conds %>%
group_by(ID, neighb) %>%
summarise(meanAcc = mean(acc), meanRT = mean(RT, na.rm = TRUE))

head(participant_data_conds)
## # A tibble: 6 x 4
## # Groups:   ID [2]
##       ID neighb meanAcc meanRT
##    <int> <fct>    <dbl>  <dbl>
## 1 508739 many       0.2  7304.
## 2 508739 none       0.6  5428.
## 3 508739 one        0.6  8486.
## 4 508745 many       0.4  4165.
## 5 508745 none       0.8  4913.
## 6 508745 one        0.6  4084.

This gives a row for each participant for each condition. If we wanted, we could rearrange this dataset to be one row per participant using the pivot_wider() function. This allows us to specify the unique identifier we want for each row (the participant ID), how we want to organise the columns (neighb), and the values that we want in those columns (both meanAcc and meanRT). We will cover the pivot_wider()function in more detail below.

participant_data_conds_w <- participant_data_conds %>%
pivot_wider(id_cols = ID, names_from = neighb, values_from = c(meanRT, meanAcc))

head(participant_data_conds_w)
## # A tibble: 6 x 7
## # Groups:   ID [6]
##       ID meanRT_many meanRT_none meanRT_one meanAcc_many meanAcc_none meanAcc_one
##    <int>       <dbl>       <dbl>      <dbl>        <dbl>        <dbl>       <dbl>
## 1 508739       7304.       5428.      8486.          0.2          0.6         0.6
## 2 508745       4165.       4913.      4084.          0.4          0.8         0.6
## 3 508749       4372.       5894.      4873.          0.6          1           1
## 4 508754       4836.       8059.      5231.          0.6          0.6         0.4
## 5 508757       5752.       6730.      3554.          1            1           1
## 6 508942       3968.       3711.      5163.          0.6          0.6         0.6

Now we have an RT column for participant averages in each neighbour condition (many, none, one), and the same again for accuracy.

Creating new variables using mutate()

Computing across columns

The mutate() function allows us to create a new column based on the content of other columns (much like writing a formula in an Excel spreadsheet, or computing a new variable in SPSS). It always follows the structure new_name = formula, with each set separated with a comma if you are computing more than one column at once.

As an example, let’s take our condition means data from above, with one column per condition. We could now compute the differences in performance between the many and no neighbour conditions, for each participant.

participant_data_conds_w <- participant_data_conds_w %>%
mutate(diffAcc = meanAcc_many - meanAcc_none,
diffRT = meanRT_many - meanRT_none)

Here we have said:

• Take the participant_data_conds_w data frame, AND THEN…
• Create a column called diffAcc that is equal to the value of each participant’s accuracy score for the “many” condition, minus their score for the “none” condition.
• Create a column called diffRT that is equal to the value of each participant’s RT for the “many” condition, minus their RT for the “none” condition.

As before, we reassigned the dataframe back to itself (participant_data_conds_w <-) to save the version with the new column added.

Let’s take a look and check:

participant_data_conds_w %>%
select(-meanAcc_one, -meanRT_one) %>%    # remove one-neighbour condition columns from preview so prints on one line
head()
## # A tibble: 6 x 7
## # Groups:   ID [6]
##       ID meanRT_many meanRT_none meanAcc_many meanAcc_none diffAcc diffRT
##    <int>       <dbl>       <dbl>        <dbl>        <dbl>   <dbl>  <dbl>
## 1 508739       7304.       5428.          0.2          0.6   -0.40  1875.
## 2 508745       4165.       4913.          0.4          0.8   -0.4   -748.
## 3 508749       4372.       5894.          0.6          1     -0.4  -1522.
## 4 508754       4836.       8059.          0.6          0.6    0    -3223.
## 5 508757       5752.       6730.          1            1      0     -978.
## 6 508942       3968.       3711.          0.6          0.6    0      257.

You can see that we now have two extra columns at the end, and that mutate() has performed the operations separately for each row. Again, it’s always a good idea to check a few of these manually to make sure they are as you would expect.

Monitoring participant eligibility based on performance

One of the advantages of scripting your data processing before/early in data collection is that you can keep on top of whether you need to replace participants who perform too poorly (or too well) to be included in your analysis. The mutate() function can help us here too.

For example, in this dataset, chance level performance was 25% (there were four answer options), and we might want to exclude participants who performed below this level as we suspect they were not paying attention during the experiment. Here, we can use mutate() to flag participants who don’t meet the criteria.

participant_data <- participant_data %>%
mutate(eligibility = ifelse(meanAcc > .25, 1, 0))

I have called the new column eligibility, and used an ifelse() function to determine what goes in the column (very much like the “IF” formula you might have come across in Excel). It says, if participant’s mean accuracy (meanAcc) is greater than .25, assign a 1 to the column. Else, assign a 0.

We can then use our new column to quickly see how many eligible participants we have collected so far. We group the data by eligibility, and count the number of participants in each.

participant_data %>%
group_by(eligibility) %>%
count()
## # A tibble: 2 x 2
## # Groups:   eligibility [2]
##   eligibility     n
##         <dbl> <int>
## 1           0     1
## 2           1    51

We can see that one participant will not be eligible for our analyses, and can release another participant slot online to replace them.

Dealing with ineligible data points

Whilst we’re at it, combining mutate() and ifelse() is also very handy for excluding data points we don’t want. For example, in cognitive tasks, we often only care about reaction times for the trials that the participant answered correctly.

So if we go back to our trial-level data, we can create another version of the RT data that only includes the value if the accuracy column showed the trial as correct (1).

trial_data <- trial_data %>%
mutate(accRT = ifelse(acc == 1, RT, NA))

Here we have said:

• Reassign the trial_data dataframe, by taking trial_data, AND THEN…
• Creating a column called accRT, which for each row consists of the following:
• if acc is equal to 1, use the value in the RT column
• else (if acc is NOT equal to 1), use NA to record it as missing data

Let’s check if it worked:

head(trial_data)
##       ID   item acc      RT   accRT
## 1 508739 vorgal   0 14449.2      NA
## 2 508739  mowel   1  8980.6  8980.6
## 3 508739 peflin   1  3813.7  3813.7
## 4 508739 dester   0  6945.2      NA
## 5 508739  wabon   1 12621.1 12621.1
## 6 508739 ballow   0  6757.5      NA

We can see that the accRT column only contains the reaction times for the trials that were answered correctly. Now, when computing our participant averages, we can use the accRT column so that it’s only incorporating RTs for correct answers. Remember to include na.rm = TRUE to ignore missing values!

participant_data <- trial_data %>%
group_by(ID) %>%
summarise(meanAcc = mean(acc),
meanRT = mean(accRT, na.rm = TRUE))

Combining output files using bind_rows() and _join()

Merging output files from the same task

It might be that we have the same task presented across different experiment nodes in Gorilla. If we want to treat all files the same, we can read them in and append them to each other before we carry out the data processing steps. In this dataset, participants were randomised into different counterbalancing conditions at the start of the experiment: different groups of participants learned and were tested on different pseudoword-object pairings. However, the task set up was identical, so we want to piece the different output files back together.

To do this, we can first create a list of the files that we want. Here, I’ve added a second version of the task from a different counterbalancing condition.

files <- c("./story_materials/data_exp_4424-v9_task-y5i7.csv",
"./story_materials/data_exp_4424-v9_task-lwe7.csv")

Or, if you want to read in all of the csv files in a particular folder, you can also create this list more efficiently using the list.files() function. It takes information about the file path (so I’ve pointed it to my folder of Story Materials), and a pattern to match to identify relevant files (I’ve specified I want the “.csv” files only).

allfiles <- list.files(path = "./story_materials/", pattern = ".csv")

However, we’ll stick to our two files for now. We want to read in each file, and bind them together:

raw_data_comb <- lapply(files, read.csv) %>%
bind_rows()

The function lapply is a base R function, which applies the same function across a list of objects. Here, we’ve asked it to apply the read.csv function to the list of files that we created. We then use the pipe operator %>% to feed the output to the bind_rows() function, which sticks it all together. Et voila!

We can use nrow() to check that our new combined version contains more observations than our first one did. You can also see this by checking the number of observations listed for the dataframe in the environment pane.

# Original raw data file
nrow(raw_data)
## [1] 1299
# New combined data file
nrow(raw_data_comb)
## [1] 2537

Note: if you are running this yourself, R might spit some red warnings at you to tell you that it’s coerced one of your variables into a character type. This looks scary, but there’s no need to panic! This is simply letting you know how it’s interpreted the information, and can be a possible thing to investigate if your output isn’t as you expect.

Merging participant data from different tasks

Often, we will have collected data about our participants across different tasks, and will want to merge that information together. This could be anything from the basic demographic information that we collected at the start of the experiment to their performance on a different task.

For an example here, we will use our original set of participants above (participant_data), and match up their performance the next day (day2_data). I’ve processed this file in the same way that we did in Part 1 to produce day2 scores.

day2_data <- read.csv("./story_materials/data_exp_4424-v9_task-z5kw.csv") %>%
filter(Zone.Type == "response_button_text") %>%
set_names(c("ID", "item", "acc", "RT")) %>%
group_by(ID) %>%
summarise(day2Acc = mean(acc), day2RT = mean(RT, na.rm = TRUE))

head(day2_data)
## # A tibble: 6 x 3
##       ID day2Acc day2RT
##    <int>   <dbl>  <dbl>
## 1 508739   0.6    3573.
## 2 508745   0.667  3408.
## 3 508749   0.667  4356.
## 4 508754   0.667  4634.
## 5 508757   0.733  4090.
## 6 508942   0.667  4723.

The aim here is to merge this day2_data with the participant_data dataset we made earlier, such that we have all information for one participant on a single row.

full_data <- participant_data %>%
full_join(day2_data, by = "ID")

In the code above, we have created a new dataframe (full_data) by taking participant_data and completing a full join with the day2_data. By using a full_join() (rather than the left_join() we used above), all rows from each dataset will be kept, even if they don’t have a match in the other. This means that participants missing information on a particular task won’t just disappear from your dataset.

We can check that our merge has been successful by previewing the dataset:

head(full_data)
## # A tibble: 6 x 5
##       ID meanAcc meanRT day2Acc day2RT
##    <int>   <dbl>  <dbl>   <dbl>  <dbl>
## 1 508739   0.467  6747.   0.6    3573.
## 2 508745   0.6    4352.   0.667  3408.
## 3 508749   0.867  5006.   0.667  4356.
## 4 508754   0.533  5130.   0.667  4634.
## 5 508757   1      5345.   0.733  4090.
## 6 508942   0.6    4413.   0.667  4723.

It’s also a good idea to check that you haven’t lost (or gained!) participants by accident. You can do this by looking at the number of observations documented for the dataframes in the Environment Pane (e.g., “52 obs.”), or by printing out the number of rows for each version as we did above.

nrow(participant_data)
## [1] 52
nrow(day2_data)
## [1] 52
nrow(full_data)
## [1] 52

We can see that here they are the same, as all participants completed both the first and second activity, and the merge has been successful in merging these together. If the numbers mismatch in your own dataset, you will want to think about why—did some participants have missing data across the tasks? Have any rows not successfully merged (e.g., due to an error in the participant ID number)?

Reshaping your dataset using pivot_()

Finally, it’s useful to know how to rearrange your data so that it can be used for different kinds of analysis. Many statistical functions in R like “long format” data—one trial per participant per row, as in our trial_data above. But what if you want to inspect patterns of performance across items? “Wide format” data—all data for one participant on a single row—may be helpful for saving out your data in a concise way, or for certain types of analysis (e.g., PCA).

The tidyverse pivot_wider() and pivot_longer() functions are very helpful here. You might also see reference to spread() and gather() if you look for help on the internet—these work very similarly, but will eventually be retired and replaced by the pivot functions.

From long to wide

For example, in the code below, we take our trial_data (using the accuracy information only, the - in the select() function allows us to drop the RT information). We then tell pivot_wider() that it should take column names from the current item column, and use the values in the acc column to fill them.

participant_items <- trial_data %>%
select(-RT, -accRT) %>%
pivot_wider(names_from = item, values_from = acc)

We can check that it worked by previewing the dataset, and by checking that we now have one row per participant.

head(participant_items)
## # A tibble: 6 x 16
##       ID vorgal mowel peflin dester wabon ballow pungus tesdar rafar femod parung tabric regby solly nusty
##    <int>  <int> <int>  <int>  <int> <int>  <int>  <int>  <int> <int> <int>  <int>  <int> <int> <int> <int>
## 1 508739      0     1      1      0     1      0      0      1     0     0      1      1     1     0     0
## 2 508745      1     0      1      0     1      0      0      1     0     1      0      1     1     1     1
## 3 508749      1     1      1      0     1      0      1      1     1     1      1      1     1     1     1
## 4 508754      0     1      1      1     1      0      0      1     0     0      1      0     1     0     1
## 5 508757      1     1      1      1     1      1      1      1     1     1      1      1     1     1     1
## 6 508942      1     1      1      0     1      0      0      0     0     0      1      1     1     1     1
nrow(participant_items)
## [1] 52

From wide to long

And whilst we’re at it, what if we want to go from the “wide” dataset back to “long” again? We tell pivot_longer() to use all columns apart from the ID column (cols = -ID); the column names should go back to an item column, and the values should go in an acc column.

items_back <- participant_items %>%
pivot_longer(cols = -ID, names_to = "item", values_to = "acc")

head(items_back)
## # A tibble: 6 x 3
##       ID item     acc
##    <int> <chr>  <int>
## 1 508739 vorgal     0
## 2 508739 mowel      1
## 3 508739 peflin     1
## 4 508739 dester     0
## 5 508739 wabon      1
## 6 508739 ballow     0

We can see now that the number of rows in the returned version match our original trial level data:

nrow(items_back)
## [1] 780
nrow(trial_data)
## [1] 780

Summary

I hope you have seen that the tidyverse tools can be helpful in efficiently processing new datasets. In the initial example, we used the following functions to process our data:

• read.csv() to read in the data.
• filter() to keep one row per trial, containing the participant response.
• select() to keep the relevant columns.
• set_names() or rename() to name the columns in a more intuitive way.
• group_by() and summarise() to calculate summary statistics for each participant. These are also useful when producing descriptive statistics across the whole sample.

Beyond this, we also learned a few other helpful tools for working with the data and applying your knowledge to more complex datasets:

• mutate() to create new variables based on the content of other columns. We also combined this with ifelse() to create variables based on certain conditions.
• left_join() and full_join() to join dataframes together horizontally (on the basis of matching row IDs).
• bind_rows() to “stack” dataframes on top of each other, adding extra rows (on the basis of matching columns).
• pivot_wider() and pivot_longer() to rearrange data between long and wide formats.

We’ve also used head(), count(), and nrow() throughout to keep checking that our data processing looked sensible.

Just as we saw at the end of Part 1, using the pipe operator means that you can easily incorporate these steps into your data processing—feeding from one step to the next. All in one tidy and efficient block of code!

How to find help

If you get stuck with the material here, I encourage you to email me emma.james@york.ac.uk—I would appreciate the feedback in helping to make this material clearer and more accessible.

More broadly speaking, there are some key sources of help when you get stuck with your R tasks.

1. If you can’t remember what you need to put in a particular function (or want to know what other options might be available), you can type help(function) in the console to bring up its documentation (replacing function with the function name, e.g., help(left_join).
2. There are some excellent cheatsheets available for quick reference. For example, the dplyr cheatsheet can be found here.
3. Google is your friend. It’s highly likely that multiple people will have asked very similar questions to you on platforms such as StackExchange - frequent googling is very normal, even for professional programmers!

On that last note, I wish to provide a final reassurance to embrace the errors and the warnings. Sometimes they are warnings that you can choose to ignore (as we did above when binding the dataframes), but they are useful to investigate if things aren’t turning out as you expect. These are readily copy and pasted into Google, where you can find more information to understand it better. It doesn’t mean that you’ve broken R!