Objectives

The aim of this week is introduce data skills necessary for cleaning a data set. We will also cover further R basics, such as packages and the working directory. By the end of this workbook, you should be able to:

  1. Understand scripts.
  2. Install and load packages.
  3. Understand the working directory.
  4. Import a data set from the working directory (CSV and Excel files).
  5. Calculate descriptive statistics for a variable using the summarise() function, including:
  • Mean
  • Standard Deviation

Class Data

Click here to download the data used in these workbooks. (You may need to right-click and select ‘Save Linked File’ option)

Content

Scripts

One of the advantages of using R to clean and analyse data is the use of scripts. Scripts are where you can save code used to manipulate a data set for analysis. As a reminder, a script is different from the console; the script is where you save code for later use, while the console is where you would run the code. If you would like to run code saved in a script, you will need to copy it or “send” it to the console. In RStudio, the script is in the top left window, while the console is the bottom left window.

Having a script is advantageous because:

  1. Scripts reduces human error.
  2. Scripts keeps a record of manipulations on data sets (again, reducing errors and data corruption, or reliance on your own memory).
  3. Scripts are easily shareable and reproducible.
  4. A change can be introduced in the analysis pipeline without re-doing the whole process.

A good way to structure scripts are the following steps. There is no obligation to follow this convention for your own scripts, but I would highly suggest sticking to it to begin with:

  1. Load in any packages you need to use (more on this below).
  2. Import the data you will be working with (more on this below).
  3. Prepare the data for analysis (e.g., cleaning your data ready for analysis).
  4. Analyse your data (e.g., conduct statistical tests, create visualisations of the data, etc. More on this in future Demonstrations).

A new script can be created in RStudio by selecting the following in the drop-down menus:

File >> New File >> R Script

You can also save and open R scripts under the File tab. Scripts are opened in the text editor window (top left window) of RStudio.

A handy thing to know is that you can make comments on your code in a script by using the ‘#’ symbol. Anything after this symbol on a line will not be treated as code. This is handy for making notes to yourself about what your code does, or making your code more readable. Here are some examples of using comments:

#This line will not run because it is commented out.

x <- 2 #This line save the object 'x' as the number 2.

There are no limits to the number of comments you can have in a script, so if it helps you remember what each bit of code does, add as many comments as you want!

Packages

Base R comes with a lot of in-built functions, but one of the great things about R is that there is a large community writing functions and including them in packages for free. Packages are like add-ons that can expand the number of functions you can use in R. There are thousands of packages that different R users have created to solve many different kinds of problems. If you want to do something in R, it is highly likely there’s already a package for that.

Packages have to be downloaded and installed onto your computer before you can use them in R. You can install packages from the main R repository (CRAN) using the install.packages() function.

There is an important distinction between installing a package and loading a package. Installing a package is done using install.packages(). This is like installing an app on your smartphone: you only have to do it once and the app will remain installed until you remove it. Likewise, when you install a package, the package will be available (but not loaded) every time you open up R.

Once a package has been installed, you can load a package using the library() function. This is like launching an app on your phone. Likewise, if you were to run the code library(packagename), you will load all the functions in the package called ‘packagename’ and they will be available to you for the rest of your R session. However, the next time you start R, you will need to run the library(packagename) function again if you want access to those functions again.

A package we will use frequently is called ‘tidyverse’. ‘tidyverse’ is actually a group of packages that are useful for cleaning and organising data.frames. Install and load ‘tidyverse’ using the following commands. When installing a package, you may be asked to select a CRAN. This is the repository packages will be downloaded from. Select a server in the UK. Remember, you only need to install the ‘tidyverse’ package onto your computer once, but you will need to load it every time you start a new R session.

install.packages("tidyverse")   #This line installs tidyverse onto your computer

library(tidyverse)              #This line loads tidyverse for this session

The Working Directory

Often in R, you will need to import files saved on your computer, such as data sets. When analysing data in R, you usually want to have all of your scripts and data files in one folder of your computer. This folder is known as the working directory. It is best practice for different projects to be saved in different folders, and therefore each has a separate working directory.

To change your working directory, from the drop-down menus click on:

Sessions >>> Set Working Directory >>> Choose Directory…

Then select the desired folder.

Alternatively, you can use the setwd() function within a script. Just note that if you change computers, the file directory path will most likely be different.

Your working directory will be displayed in the bottom right window or RStudio when you click on the ‘Files’ tab.

Importing Data

Data can be saved as many different file types on your computer. These different file types are usually distinguished by the three letter extension following a period at the end of the file name. Here are some examples of common data file types and the functions you would use to read them in or write them out.

Extension File Type package import
.csv comma separated value tidyverse read_csv()
.xls, .xlsx Excel file readxl read_excel()
.sav SPSS data file foreign read.spss()

In general, I would suggest using a .csv file. This is advantageous because it can be read by many programs and is not tied to a specific program (like a SPSS or excel file), which future-proofs your data. Most programs, such as SPSS and excel, can also read .csv files. Regardless of which file type you are importing, R is expecting the data to be in a certain format. Namely, each variable should be in a separate column, and the variable names should be in the first row. If your data is not in this exact format (e.g., if there is any blank space in your data file), then R may not read it properly. You may need to edit your data file manually to make sure it matches this format (there are ways you can do it in R, but for beginners, it is generally easier to do this bit manually). You can save a data set to your computer in R using the write_csv() function.

Note, you can also read and write .csv files using base R functions read.csv() and write.csv() respectively; however, these functions have some default settings that can cause headaches in the future, so best to avoid these. Also note, the read_csv() function will output some red text that looks like an error message, but this is just R telling you how it interpreted all the variables, and usually means everything is okay.

Remember to download the data for this week’s demonstration and put it in your working directory before running the code below. Note that the class data set is already in the correct format.

#Remember to load the tidyverse library if you haven't already.
#The exact directory the file is in may differ depending on where the data file is on your computer.

data <- read_csv("data_2023.csv")

Once data is loaded, you can view the data.frame by either clicking on the data.frame icon in the work space, or using the View() function in the console. This will open a tab in the top left window, and clicking on this will show you the data.frame. This can be handy to make sure your data has loaded correctly, or double-checking whether functions used to clean your data has worked as intended.

Calculating Descriptive Statistics

In the lecture series, we covered measures of central tendency and dispersion. In particular, important statistics that you will need to calculate often are the mean and the standard deviation of a continuous variable. Thankfully, there are simple functions in R that can calculate these for you. In both cases, the first argument of the function accepts a numeric vector.

In the following examples, we will calculate the mean and standard deviation for the following numeric vectors:

prime.numbers <- c(1,2,3,5,7,11,13,17,19,53)

prime.numbers.na <- c(1,NA,3,5,7,NA,17,NA,53)

Calculating the Mean

The function that calculates the mean in R is called mean(). The first argument for this function accepts a numeric vector, which is the list of numbers you want to calculate the mean of.

So to calculate the mean for the prime.numbers vector above, we can use the following code:

mean(prime.numbers)
## [1] 13.1

So, the mean for the numbers in prime.numbers is 13.10.

This seems simple enough; however, if we try the same code to compute the mean for the vector prime.numbers.na, an error message is returned. This is because prime.numbers.na has missing values. R considers any value coded as NA as a missing value. Missing values can be due to many reasons, sometimes they are meaningful, but other times they can be ignored. To ignore missing values in the mean() function, we can set the na.rm argument to TRUE:

mean(prime.numbers.na, na.rm = TRUE)
## [1] 14.33333

Calculating the Standard Deviation

The Standard Deviation of a vector can be calculated in the exact same way as the mean, except we use the sd() function.

So to calculate the standard deviation for the prime.numbers vector, we use the code:

sd(prime.numbers)
## [1] 15.35108

So the standard deviation of prime.numbers is 15.35.

Similarly, the sd() function doesn’t deal well with missing values. So to calculate the standard deviation of prime.numbers.na, we use:

sd(prime.numbers.na, na.rm = TRUE)
## [1] 19.74504

Other Descriptive Statistic Functions

Here are some other functions that may come in handy… why not try some of them out? There are many more functions that exist in R. If you ever need to calculate a certain descriptive statistic for a variable, there is probably a function for it, and you can discover what it is with a bit of Google-ing.

Function File Type
max() Returns the largest value in the vector.
min() Returns the smallest value in the vector.
sum() Adds all the numbers in the vector together.
is.na() Returns a logical vector telling you whether an element is missing.

table() Function

Another useful function is the table() function. This function tells you the frequency that each value appears for a vector. This can be handy if you need to quickly see the separate values that exists for a variable, or quickly count how often a value appears (e.g., if you need to calculate which value is the mode).

Here is an example of using the table() function with a vector:

catagories <- c("CatA","CatB","CatA","CatC","CatB","CatA")

table(catagories)
## catagories
## CatA CatB CatC 
##    3    2    1

So for instance, above, we can quickly see that there are three separate values in the vector catagories, and the one that appears most frequently (i.e., is the mode) is CatA.

We can also use the table() function for variables in data.frames. An example of this is below using the class data and the question about favourite Australian animals. How to read the code below is that we use the $ symbol to index which variable we want to look at in the data.frame called data. See the extra materials for more information on indexing.

table(data$aus.animal)
## 
##  echidna kangaroo    koala platypus   wombat 
##       10       13       24       18       15

So above, we see that there are five separate values (if we ignore missing data), and the most frequent response was koala.

Tidyverse Functions

The tidyverse package provides handy functions for data cleaning and manipulation. Remember, these functions work like all other functions: they take a number of arguments, such as a data.frame and instructions, then gives you an output that can be saved as an object, such as another data.frame. In this week’s demonstration, we will cover one function that you can use to calculate descriptive statistics; however, there are many more tidyverse functions. We will cover a few more next week, but for a full list, see the tidyverse cheatsheets, or view the help documentation using help().

All tidyverse functions to do with data cleaning follow the same structure:

function_name(data.frame,instructions)

where the first argument is always the data.frame you wish to perform the function on, and the remaining arguments are instructions on how you wish to perform that function.

summarise()

Functions like mean() and sd() are great when you want to calculate descriptive statistics from a vector, but more often than not, you will need to get these values from a variable in a data set. One way to do this is use the summarise() function from the tidyverse package.

So how does the summarise() function work? This function takes multiple arguments. The first argument of the function is the data.frame that has the variables you’re interested in. The remaining arguments tells R which descriptive statistics you would like it to calculate (i.e., multiple descriptive statistics can be calculated at once). You can tell summarise() which stats you want it to calculate using the same functions we covered above (e.g., mean(),sd(), etc.). In return, summarise() will return another data.frame with the descriptive statistics you’ve asked for. Therefore, for each stat that you calculate, you will need to provide a new variable name. For each descriptive stat that you want to calculate using summarise(), you will need to provide an additional argument that takes the following form:

variable_name = function(variable)

This may sound confusing, but it helps to see it in action. Below is the code to calculate the mean and standard deviation of the variable stats_exp in the data.frame called data.

summarise(data,
          stats_exp.mean = mean(stats.exp, na.rm = TRUE),
          stats_exp.sd = sd(stats.exp, na.rm = TRUE))
## # A tibble: 1 × 2
##   stats_exp.mean stats_exp.sd
##            <dbl>        <dbl>
## 1           4.29         2.06

A couple of things to note. First, I have broken the above code across several lines. This is just to make it easier to read (remember R ignores white space inside an incomplete command). Second, it doesn’t matter what you name the new variable (but ideally you want a label that is helpful).

As mentioned, you are not limited in the number of stats you can compute using the summarise() function, so by expanding the code above, we can calculate the means and standard deviations of multiple variables:

summarise(data,
          stats_exp.mean = mean(stats.exp, na.rm = TRUE),
          stats_exp.sd = sd(stats.exp, na.rm = TRUE),
          stats_anx.mean = mean(stats.anx,na.rm = TRUE),
          stats_anx.sd = sd(stats.anx,na.rm = TRUE))
## # A tibble: 1 × 4
##   stats_exp.mean stats_exp.sd stats_anx.mean stats_anx.sd
##            <dbl>        <dbl>          <dbl>        <dbl>
## 1           4.29         2.06           5.59         2.30

Note: An annoying thing about the summarise() function is that sometimes it will not show results to two-decimal places (which is the requirement for APA format). To see all the decimal places, you can View() the data.frame produced by the summarise() function.

Exercises

Now that you’ve completed this week’s demonstration, why not give this week’s exercises a go? You can download the interactive exercises by clicking the link below.

Click here to download this week’s exercises.