Content
Overview - Data Cleaning
Data cleaning is the process of getting the data ready for analysis. It usually involves selecting the variables you will use in the analysis, filtering out participants that don’t meet certain criteria, or computing composite variables that will be used in analyses. However, it can also include more complex operations, such as reshaping the data, or joining data from different data files.
In psychology, there are three general key principles you want to follow when cleaning a data.frame:
- Make sure each participant is on a separate row.
- Make sure each variable (or measurement) is in a separate column.
- Make sure each cell has one and only one value.
Recap: Tidyverse
Recall from last week that the tidyverse package provides handy functions for data cleaning and manipulation. These functions work like all other functions: they take numerous of arguments, such as a data.frame and instructions, then returns to you an output, which is often another data.frame. In this workbook, we will cover three such functions that are commonly used for data cleaning.
Make sure to load the tidyverse package before starting this workbook.
library(tidyverse)
select()
Use the select()
to extract columns in a table. Often
when loading a dataset, there will be columns/variables that are not
relevant to the analysis you want to conduct. This function can be
helpful when a dataset contains many columns and you only want to
include the variables relevant to your analysis.
For example, the class dataset used in these workbooks include a large number of variables; however, we will only be working with a few of the many variables any given week. Therefore, we may want to select only those variables so that the dataset is easier to work with. In the code below, we are only selecting the columns that we are using this week.
data.select <- select(data,student.no,program,stats.exp,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7)
filter()
Similar to how the select()
function selects for certain
columns, you can use the filter()
function to select for
certain rows from a table. This is useful if you need to exclude
participants from your analysis for some reason.
To use the filter()
function, you will need to specify a
certain condition (or test). If a row passes that test, then it is kept.
If a row fails that test, then it is excluded.
For example, let’s say that we are only interested in responses from participants completing the Health Psychology MSc. We can use the filter function to only select these participants.
data.filter1 <- filter(data.select,program == "Health MSc")
What if we were interested in the reverse? Here is the code to select responses NOT from Health MSc participants.
data.filter2 <- filter(data.select,program != "Health MSc")
Operators
Here is a good time to talk about logical operators. Logical
operators are expressions that, when evaluated, return either a
TRUE
or FALSE
value.
You have already been exposed to two logical operators in the code above, but here are a few more:
Operator | Description |
---|---|
== |
Exactly equal to |
!= |
Not equal to |
< |
Less than |
> |
Greater than |
<= |
Less than or equal to |
>= |
Greater than or equal to |
These are some of the basic ones, but there are more. For a full list of operators, click here.
Notice that some of these operators only work with numerical values. So for example, in the following numerical vector, we can test whether the number is less than 3.
c(1,2,3,4,5) < 3
## [1] TRUE TRUE FALSE FALSE FALSE
What we find is that the first two responses are less than 3, so
return a TRUE
value, while the rest are not less than 3
(including 3 itself!), and therefore return a FALSE
value.
This can be used to filter participants based on a numerical
threshold. For example, in the class dataset, if we were only interested
in participants with a lot of experience in statistics, we can use the
greater than or equal to >=
operator to select
participants who rated this item a 7, 8, or 9.
data.filter3 <- filter(data.select,stats.exp >= 7)
Removing Participants with Missing Data
A common thing you are required to do when cleaning data is to remove
participants that have missing data on a variable. As covered last week,
by default, R codes missing data as NA
. Also covered last
week, there is a function in R that tells you if a value is missing -
that function is is.na()
. We can use this function within
the filter()
function to remove participants from a dataset
that has missing data on a variable.
data_remove.missing <- filter(data,!is.na(var1))
In the code above, the !
can be translated to
not
. So here we are only including participants that do NOT
have missing data for var1
.
Note: it is possible that your data could code missing data as a
different value (e.g., a common practice is to code missing values as
‘-99’) - in these instances you would want to remove participants with
this value, rather than use is.na()
.
Also note: another useful function to remove participants with
missing data is na.omit()
. This function removes all
participants that have a missing value on any of the variables in the
dataset. However, this function must be used cautiously; if participants
are missing data on a variable not important to the analysis, then they
will also be removed when they could have been kept in the dataset!
mutate()
We can use the mutate()
function to calculate a new
variable (column) from existing variables. Mutate applies a vectorised
function, which mean operations are conducted separately per row. This
is useful for making aggregate scores (e.g., summing all items in a
scale).
In the example below, we wish to calculate a total score of pathogen disgust from 7 separate items (Click here for further details on the disgust scale). We can add the responses to each item of the pathogen disgust scale to compute each participant’s pathogen disgust score. Let’s do it on the dataset with only Health MSc participants:
data.mutate <- mutate(data.filter1,
pathogen.disgust = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)
Let me draw your attention to a few things with the code above.
First, like a tidyverse functions, the first argument of the
mutate()
function is the data.frame we want to calculate
the new variable in. Second, we specify the name of the new variable,
followed by how that variable will be calculated. This syntax is very
similar to that used by the summarise()
function last
week.
This is the data.frame we created using the mutate function. If you scroll all the way to the right, you will see there is a new column…
Other tidyverse Functions
We have covered three basic tidyverse functions here, but there are many more that are useful for data cleaning. These include functions to separate and unite variables, or reshape the data. To see some extra tidyverse functions, see this page for more content, or for a full list of tidyverse functions, check out the tidyverse cheatsheets.
Pipes (%>%)
Often when data cleaning you will have to do a number of operations to a dataset concurrently. One way to do this is to list the operations one after another, like we have done above. I have re-created the code below.
data.select <- select(data,program,stats.exp,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7)
data.filter1 <- filter(data.select,program == "Health MSc")
data.mutate <- mutate(data.filter1,pathogen.disgust = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)
There is nothing wrong with the code above, but sometimes doing this can make your code hard to read. Also, you are required to save intermediate datasets as objects, which can clutter up your workspace.
A Pipe (%>%
) is a special type of function used to
string multiple commands in R together. Pipes can be confusing at first,
but they will make code more readable and efficient, so it is worth
learning if you want to become an expert R coder. You can load the pipe
by loading the tidyverse package.
How pipes work is that it passes an intermediate object from one function onto the next function. More specifically, the output from one function is passed to the first argument of the next function. One way to conceptualise it is think of it as an “AND THEN” statement. Tidyverse functions return a data.frame, and accept a data.frame as its first argument, so they are ideal for chaining together using a pipe. So, the code above becomes the following when using pipes:
data.analysis <- data %>% #Select our raw dataset
select(program,stats.exp,pathogen1,pathogen2,pathogen3, #Select the variables
pathogen4,pathogen5,pathogen6,pathogen7) %>%
filter(program == "Health MSc") %>% #Select Health MSc participants
mutate(pathogen.disgust = pathogen1 + pathogen2 + #Compute pathogen disgust score
pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)
The real advantage of piping is apparent when dealing with nested functions (i.e., functions that take arguments which themselves include a function). So for instance, if you wanted to do the above without saving intermediate datasets, you would end up with the code below, which is a hot mess.
hot.mess <- mutate(select(filter(data,program == "Health MSc"),program,stats.exp,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7),pathogen.disgust = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)