Content

Overview - Data Cleaning

Data cleaning is the process of getting the data ready for analysis. It usually involves selecting the variables you will use in the analysis, filtering out participants that don’t meet certain criteria, or computing composite variables that will be used in analyses. However, it can also include more complex operations, such as reshaping the data, or joining data from different data files.

In psychology, there are three general key principles you want to follow when cleaning a data.frame:

Make sure each participant is on a separate row.
Make sure each variable (or measurement) is in a separate column.
Make sure each cell has one and only one value.

Recap: Tidyverse

Recall from last week that the tidyverse package provides handy functions for data cleaning and manipulation. These functions work like all other functions: they take numerous of arguments, such as a data.frame and instructions, then returns to you an output, which is often another data.frame. In this workbook, we will cover three such functions that are commonly used for data cleaning.

Make sure to load the tidyverse package before starting this workbook.

library(tidyverse)

select()

Use the select() to extract columns in a table. Often when loading a dataset, there will be columns/variables that are not relevant to the analysis you want to conduct. This function can be helpful when a dataset contains many columns and you only want to include the variables relevant to your analysis.

For example, the class dataset used in these workbooks include a large number of variables; however, we will only be working with a few of the many variables any given week. Therefore, we may want to select only those variables so that the dataset is easier to work with. In the code below, we are only selecting the columns that we are using this week.

data.select <- select(data,student.no,program,stats.exp,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7)

filter()

Similar to how the select() function selects for certain columns, you can use the filter() function to select for certain rows from a table. This is useful if you need to exclude participants from your analysis for some reason.

To use the filter() function, you will need to specify a certain condition (or test). If a row passes that test, then it is kept. If a row fails that test, then it is excluded.

For example, let’s say that we are only interested in responses from participants completing the Health Psychology MSc. We can use the filter function to only select these participants.

data.filter1 <- filter(data.select,program == "Health MSc")

What if we were interested in the reverse? Here is the code to select responses NOT from Health MSc participants.

data.filter2 <- filter(data.select,program != "Health MSc")

Operators

Here is a good time to talk about logical operators. Logical operators are expressions that, when evaluated, return either a TRUE or FALSE value.

You have already been exposed to two logical operators in the code above, but here are a few more:

Operator	Description
`==`	Exactly equal to
`!=`	Not equal to
`<`	Less than
`>`	Greater than
`<=`	Less than or equal to
`>=`	Greater than or equal to

These are some of the basic ones, but there are more. For a full list of operators, click here.

Notice that some of these operators only work with numerical values. So for example, in the following numerical vector, we can test whether the number is less than 3.

c(1,2,3,4,5) < 3

## [1]  TRUE  TRUE FALSE FALSE FALSE

What we find is that the first two responses are less than 3, so return a TRUE value, while the rest are not less than 3 (including 3 itself!), and therefore return a FALSE value.

This can be used to filter participants based on a numerical threshold. For example, in the class dataset, if we were only interested in participants with a lot of experience in statistics, we can use the greater than or equal to >= operator to select participants who rated this item a 7, 8, or 9.

data.filter3 <- filter(data.select,stats.exp >= 7)

Removing Participants with Missing Data

A common thing you are required to do when cleaning data is to remove participants that have missing data on a variable. As covered last week, by default, R codes missing data as NA. Also covered last week, there is a function in R that tells you if a value is missing - that function is is.na(). We can use this function within the filter() function to remove participants from a dataset that has missing data on a variable.

data_remove.missing <- filter(data,!is.na(var1))

In the code above, the ! can be translated to not. So here we are only including participants that do NOT have missing data for var1.

Note: it is possible that your data could code missing data as a different value (e.g., a common practice is to code missing values as ‘-99’) - in these instances you would want to remove participants with this value, rather than use is.na().

Also note: another useful function to remove participants with missing data is na.omit(). This function removes all participants that have a missing value on any of the variables in the dataset. However, this function must be used cautiously; if participants are missing data on a variable not important to the analysis, then they will also be removed when they could have been kept in the dataset!

mutate()

We can use the mutate() function to calculate a new variable (column) from existing variables. Mutate applies a vectorised function, which mean operations are conducted separately per row. This is useful for making aggregate scores (e.g., summing all items in a scale).

In the example below, we wish to calculate a total score of pathogen disgust from 7 separate items (Click here for further details on the disgust scale). We can add the responses to each item of the pathogen disgust scale to compute each participant’s pathogen disgust score. Let’s do it on the dataset with only Health MSc participants:

data.mutate <- mutate(data.filter1,
                      pathogen.disgust = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)

Let me draw your attention to a few things with the code above. First, like a tidyverse functions, the first argument of the mutate() function is the data.frame we want to calculate the new variable in. Second, we specify the name of the new variable, followed by how that variable will be calculated. This syntax is very similar to that used by the summarise() function last week.

This is the data.frame we created using the mutate function. If you scroll all the way to the right, you will see there is a new column…

Other tidyverse Functions

We have covered three basic tidyverse functions here, but there are many more that are useful for data cleaning. These include functions to separate and unite variables, or reshape the data. To see some extra tidyverse functions, see this page for more content, or for a full list of tidyverse functions, check out the tidyverse cheatsheets.

Pipes (%>%)

Often when data cleaning you will have to do a number of operations to a dataset concurrently. One way to do this is to list the operations one after another, like we have done above. I have re-created the code below.

data.select <- select(data,program,stats.exp,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7)
data.filter1 <- filter(data.select,program == "Health MSc")
data.mutate <- mutate(data.filter1,pathogen.disgust = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)

There is nothing wrong with the code above, but sometimes doing this can make your code hard to read. Also, you are required to save intermediate datasets as objects, which can clutter up your workspace.

A Pipe (%>%) is a special type of function used to string multiple commands in R together. Pipes can be confusing at first, but they will make code more readable and efficient, so it is worth learning if you want to become an expert R coder. You can load the pipe by loading the tidyverse package.

How pipes work is that it passes an intermediate object from one function onto the next function. More specifically, the output from one function is passed to the first argument of the next function. One way to conceptualise it is think of it as an “AND THEN” statement. Tidyverse functions return a data.frame, and accept a data.frame as its first argument, so they are ideal for chaining together using a pipe. So, the code above becomes the following when using pipes:

data.analysis <- data %>%                                     #Select our raw dataset
  select(program,stats.exp,pathogen1,pathogen2,pathogen3,     #Select the variables
         pathogen4,pathogen5,pathogen6,pathogen7) %>%
  filter(program == "Health MSc") %>%                         #Select Health MSc participants
  mutate(pathogen.disgust = pathogen1 + pathogen2 +           #Compute pathogen disgust score
           pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)

The real advantage of piping is apparent when dealing with nested functions (i.e., functions that take arguments which themselves include a function). So for instance, if you wanted to do the above without saving intermediate datasets, you would end up with the code below, which is a hot mess.

hot.mess <- mutate(select(filter(data,program == "Health MSc"),program,stats.exp,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7),pathogen.disgust = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7)

HPSP131 - Workbook 3 - Data Skills: Cleaning

Objectives

Class Data

Exercises