The aim of this tutorial is to provide an introduction to data manipulation in R, primarily using tools from the tidyverse. Given that lots of people are currently moving their data collection procedures online, we will use an output file from Gorilla as an example. However, the tools should readily apply regardless of your data collection software.1

In this first part of the tutorial, we will cover the basics of extracting the relevant data from your output files. In Part 2, we will cover some extra tips and tricks for monitoring sample size during online data collection, scaling up the tools to more complex datasets, and re-organising your data flexibly.

Knowledge required

This introduction is aimed at beginners, with very little experience coding in R.

However, it does assume that you can find your way around R Studio (i.e., familiar with the script and console, how to run one/several lines of code), and navigate your working directory to access your data. A basic understanding of functions and arguments will also help. Some recommended resources on this are:

If you prefer some light interactive activities, you can also try:

What’s this “tidyverse” you speak of?

As with most programming tasks, there are multiple different ways of achieving the same thing. Within the R programming language, there are some clusters of approaches (“dialects” or “grammars”, if you like). It’s not essential to acknowledge the difference - mostly you will settle for whatever tools you can get to work! However, you will see when searching your issues on StackExchange that many people provide alternative solutions, and will often refer to using Base R, the tidyverse (including dplyr, tidyr packages), and data.table. This document will focus on data manipulation using the tidyverse.

What are the pros?

  • Intuitive functions - they pretty much do what they say on the tin.
  • Readable and tidy - once you have understood a few key features (e.g., the pipe operator), it’s relatively easy to read an existing script and gauge what’s going on.
  • Efficient - you can carry out your data processing tasks in relatively few lines of code (as we will see here!)
  • Wide usership - there is a huge amount of support available online.

…and the cons?

  • Coders transferring from different programming languages sometimes prefer base R operations, as they more directly relate to their existing knowledge.
  • Doesn’t handle big data particularly well (the data.table package is better suited to this). But I really do mean big data here (i.e., sourced from rich naturally occurring datasets), most experimental data should be fine.

For me, learning to use the tidyverse took me from years of copying and pasting random bits of code from the internet, to being able to sit down and write a data processing script from scratch. I hope you’ll find it helpful too!

Ready? Let’s begin!


Thank you to Catia Oliveira, Ruth Lee, Claudia Mazzuca, and Jon Flavell for trialling and providing valuable feedback on this tutorial. Additional feedback would be gratefully received via email to emma.james@york.ac.uk.

  1. If you’re dealing with messy Excel sheets, you might like to check out these handy tips from Sophie Bennett.