This page contains extra R content not covered in the demonstrations and could be considered supplementary to the module. This content is useful for completing the advanced exercises from Week 2.
Sometimes, you will want to analyse data, but the variables of interest are in separate data.frames - maybe the data are from different sources, or collected using different programs.
The R package tidyverse
provides some handy functions
for joining data.frames. Join functions are most useful in situations
where different variables from the same participants are stored in
separate data.frames. Here are some of the more useful join
functions:
Function name | |
---|---|
left_join() | Keeps all the data in the first table, and adds columns from the second to rows that match. |
right_join() | Keeps all the data in the second table, and adds columns from the first to rows that match. |
inner_join() | Only returns rows that have a match in both tables. |
full_join() | Retains all information from both tables. If a row does not have a match in one table, it will receive a NA value for the other table’s columns. |
Each of the above functions joins two data.frames together. How they differ is what rows the function decides to retain/exclude. All of the above join functions have the same syntax:
XXX_join(x, y, by = NULL, suffix = c(".x",".y"))
Let’s break down each or the arguments
Argument | Description |
---|---|
x | The first table you want to join, known as the “left” table. |
y | The second table you want to join, known as the “right” table. |
by | The columns (variable names) you wish to join the two tables on (i.e., the variable you are ‘matching’ on). There must be a variable with this name in both tables. |
suffix | If columns have the same name in the two tables, you can add a suffix so you can distinguish between the two. The default for this is “.x” for the left table, and “.y” for the right table. |
Note: if you do not specify any columns to join on, these functions
are smart enough to determine the best variable to match on, but it is
best to get in the habit of always specifying a column using the
by
argument.
full_join()
retains all information from both tables. If
a row does not have a match in the other table, it will receive a NA
value for that table’s columns.
Here is an example of the full_join()
function:
joined.data <- full_join(data.left,data.right,by = "participant_id")
left_join()
is one of the more useful join functions.
This function keeps all data from the first (left) table and joins all
columns that has a matching row in the second (right) table. This is
particularly helpful if there are cases (rows) in the right table that
are not in the left (and is therefore not of interest to us), or if
columns in the right table apply to multiple rows in the left table. By
carefully considering which data.frame is on the left vs. right, in
almost all cases, left_join()
or full_join()
will suit your needs.
Here is an example of the left_join()
function in
action:
joined.data <- left_join(data.left,data.countries,by = "country_of_birth")
What if you need the join function to match on more than one variable? Thankfully, the join functions can accomodate this. Using the ‘by’ argument, rather than specifying one column, you can specify any number by putting the column names in a character vector. Remember, both data.frames need to have all match variables.
In the code below, we match participants based on a
paricipant_id
variable, as well as a
country_of_birth
variable.
joined.data <- left_join(data.1,data.2,by = c("participant_id","country_of_birth"))
How about a situation where both data.frames have variables with the same name, but you aren’t joining on these variables. By taking advantage of the ‘suffix’ argument in the join functions, you can create a way to unambiguously distinguish between the variables from the different data.frames. This argument accepts a character vector with a length of two. The first element of the vector will be pasted on the end of the variables from the left data.frame, while the second element will be pasted on the end of the variables on the right data.frame.
In the code below, we join two data sets from data collected at two
different time points, adding the time1
and
time2
suffix to the respective variables.
joined.data <- left_join(data.time1, data.time2, by = "participant_id",suffix = c(".time1",".time2"))
Up to now, we have exclusively dealt with join functions, which adds variables from one data.frame to another. A totally different way of merging data.frames together is combining rows. This is common when you have two data.frames with the same variables but different rows, e.g. data collected from different participants on the same task. This can be achieved by using the bind_rows() function. This function simply accepts two arguments: the two data.frames you wish to join.
In the example below, we have two data.frames that were taken from two different computers running the same experiment. We wish to combine the data.frames by adding the rows of one data.frame to another so that we can analyse the data altogether.
complete.data <- bind_rows(data_computer1,data_computer2)
A couple of things to note. First, between the data.frames, the number of columns must be the same and they must have the same names (though they don’t necessarily have to be in the same order). Second, if a row is duplicated between the two tables, the row will also be duplicated in the resulting table.
If you would like to practice the skills on this page, weekly exercise questions on this content are available in the advanced exercises for Week 2. You can download the interactive exercises by clicking the link below.