Content
Scripts
One of the advantages of using R to clean and analyse data is the use of scripts. Scripts are where you can save code used to manipulate a data set for analysis. As a reminder, a script is different from the console; the script is where you save code for later use, while the console is where you would run the code. If you would like to run code saved in a script, you will need to copy it or “send” it to the console. In RStudio, the script is in the top left window, while the console is the bottom left window.
Having a script is advantageous because:
- Scripts reduces human error.
- Scripts keeps a record of manipulations on data sets (again, reducing errors and data corruption, or reliance on your own memory).
- Scripts are easily shareable and reproducible.
- A change can be introduced in the analysis pipeline without re-doing the whole process.
A good way to structure scripts are the following steps. There is no obligation to follow this convention for your own scripts, but I would highly suggest sticking to it to begin with:
- Load in any packages you need to use (more on this below).
- Import the data you will be working with (more on this below).
- Prepare the data for analysis (e.g., cleaning your data ready for analysis).
- Analyse your data (e.g., conduct statistical tests, create visualisations of the data, etc. More on this in future Demonstrations).
A new script can be created in RStudio by selecting the following in the drop-down menus:
File >> New File >> R Script
You can also save and open R scripts under the File tab. Scripts are opened in the text editor window (top left window) of RStudio.
A handy thing to know is that you can make comments on your code in a script by using the ‘#’ symbol. Anything after this symbol on a line will not be treated as code. This is handy for making notes to yourself about what your code does, or making your code more readable. Here are some examples of using comments:
#This line will not run because it is commented out.
x <- 2 #This line save the object 'x' as the number 2.
There are no limits to the number of comments you can have in a script, so if it helps you remember what each bit of code does, add as many comments as you want!
Packages
Base R comes with a lot of in-built functions, but one of the great things about R is that there is a large community writing functions and including them in packages for free. Packages are like add-ons that can expand the number of functions you can use in R. There are thousands of packages that different R users have created to solve many different kinds of problems. If you want to do something in R, it is highly likely there’s already a package for that.
Packages have to be downloaded and installed onto your computer
before you can use them in R. You can install packages from the main R
repository (CRAN) using the install.packages()
function.
There is an important distinction between installing a package and
loading a package. Installing a package is done using
install.packages()
. This is like installing an app on your
smartphone: you only have to do it once and the app will remain
installed until you remove it. Likewise, when you install a package, the
package will be available (but not loaded) every time you open up R.
Once a package has been installed, you can load a package using the
library()
function. This is like launching an app on your
phone. Likewise, if you were to run the code
library(packagename)
, you will load all the functions in
the package called ‘packagename’ and they will be available to you for
the rest of your R session. However, the next time you start R, you will
need to run the library(packagename)
function again if you
want access to those functions again.
A package we will use frequently is called ‘tidyverse’. ‘tidyverse’ is actually a group of packages that are useful for cleaning and organising data.frames. Install and load ‘tidyverse’ using the following commands. When installing a package, you may be asked to select a CRAN. This is the repository packages will be downloaded from. Select a server in the UK. Remember, you only need to install the ‘tidyverse’ package onto your computer once, but you will need to load it every time you start a new R session.
install.packages("tidyverse") #This line installs tidyverse onto your computer
library(tidyverse) #This line loads tidyverse for this session
The Working Directory
Often in R, you will need to import files saved on your computer, such as data sets. When analysing data in R, you usually want to have all of your scripts and data files in one folder of your computer. This folder is known as the working directory. It is best practice for different projects to be saved in different folders, and therefore each has a separate working directory.
To change your working directory, from the drop-down menus click on:
Sessions >>> Set Working Directory >>> Choose Directory…
Then select the desired folder.
Alternatively, you can use the setwd()
function within a
script. Just note that if you change computers, the file directory path
will most likely be different.
Your working directory will be displayed in the bottom right window or RStudio when you click on the ‘Files’ tab.
Importing Data
Data can be saved as many different file types on your computer. These different file types are usually distinguished by the three letter extension following a period at the end of the file name. Here are some examples of common data file types and the functions you would use to read them in or write them out.
Extension | File Type | package | import |
---|---|---|---|
.csv | comma separated value | tidyverse | read_csv() |
.xls, .xlsx | Excel file | readxl | read_excel() |
.sav | SPSS data file | foreign | read.spss() |
In general, I would suggest using a .csv file. This is advantageous
because it can be read by many programs and is not tied to a specific
program (like a SPSS or excel file), which future-proofs your data. Most
programs, such as SPSS and excel, can also read .csv files. Regardless
of which file type you are importing, R is expecting the data to be in a
certain format. Namely, each variable should be in a separate column,
and the variable names should be in the first row. If your data is not
in this exact format (e.g., if there is any blank space in your data
file), then R may not read it properly. You may need to edit your data
file manually to make sure it matches this format (there are ways you
can do it in R, but for beginners, it is generally easier to do this bit
manually). You can save a data set to your computer in R using the
write_csv()
function.
Note, you can also read and write .csv files using base R functions read.csv() and write.csv() respectively; however, these functions have some default settings that can cause headaches in the future, so best to avoid these. Also note, the read_csv() function will output some red text that looks like an error message, but this is just R telling you how it interpreted all the variables, and usually means everything is okay.
Remember to download the data for this week’s demonstration and put it in your working directory before running the code below. Note that the class data set is already in the correct format.
#Remember to load the tidyverse library if you haven't already.
#The exact directory the file is in may differ depending on where the data file is on your computer.
data <- read_csv("data_2023.csv")
Once data is loaded, you can view the data.frame by either clicking
on the data.frame icon in the work space, or using the
View()
function in the console. This will open a tab in the
top left window, and clicking on this will show you the data.frame. This
can be handy to make sure your data has loaded correctly, or
double-checking whether functions used to clean your data has worked as
intended.
Calculating Descriptive Statistics
In the lecture series, we covered measures of central tendency and dispersion. In particular, important statistics that you will need to calculate often are the mean and the standard deviation of a continuous variable. Thankfully, there are simple functions in R that can calculate these for you. In both cases, the first argument of the function accepts a numeric vector.
In the following examples, we will calculate the mean and standard deviation for the following numeric vectors:
prime.numbers <- c(1,2,3,5,7,11,13,17,19,53)
prime.numbers.na <- c(1,NA,3,5,7,NA,17,NA,53)
Calculating the Mean
The function that calculates the mean in R is called
mean()
. The first argument for this function accepts a
numeric vector, which is the list of numbers you want to calculate the
mean of.
So to calculate the mean for the prime.numbers
vector
above, we can use the following code:
mean(prime.numbers)
## [1] 13.1
So, the mean for the numbers in prime.numbers
is
13.10.
This seems simple enough; however, if we try the same code to compute
the mean for the vector prime.numbers.na
, an error message
is returned. This is because prime.numbers.na
has missing
values. R considers any value coded as NA
as a missing
value. Missing values can be due to many reasons, sometimes they are
meaningful, but other times they can be ignored. To ignore missing
values in the mean()
function, we can set the
na.rm
argument to TRUE
:
mean(prime.numbers.na, na.rm = TRUE)
## [1] 14.33333
Calculating the Standard Deviation
The Standard Deviation of a vector can be calculated in the exact
same way as the mean, except we use the sd()
function.
So to calculate the standard deviation for the
prime.numbers
vector, we use the code:
sd(prime.numbers)
## [1] 15.35108
So the standard deviation of prime.numbers
is 15.35.
Similarly, the sd()
function doesn’t deal well with
missing values. So to calculate the standard deviation of
prime.numbers.na
, we use:
sd(prime.numbers.na, na.rm = TRUE)
## [1] 19.74504
Other Descriptive Statistic Functions
Here are some other functions that may come in handy… why not try some of them out? There are many more functions that exist in R. If you ever need to calculate a certain descriptive statistic for a variable, there is probably a function for it, and you can discover what it is with a bit of Google-ing.
Function | File Type |
---|---|
max() |
Returns the largest value in the vector. |
min() |
Returns the smallest value in the vector. |
sum() |
Adds all the numbers in the vector together. |
is.na() |
Returns a logical vector telling you whether an element is missing. |
table() Function
Another useful function is the table()
function. This
function tells you the frequency that each value appears for a vector.
This can be handy if you need to quickly see the separate values that
exists for a variable, or quickly count how often a value appears (e.g.,
if you need to calculate which value is the mode).
Here is an example of using the table()
function with a
vector:
catagories <- c("CatA","CatB","CatA","CatC","CatB","CatA")
table(catagories)
## catagories
## CatA CatB CatC
## 3 2 1
So for instance, above, we can quickly see that there are three
separate values in the vector catagories
, and the one that
appears most frequently (i.e., is the mode) is CatA
.
We can also use the table()
function for variables in
data.frames. An example of this is below using the class data and the
question about favourite Australian animals. How to read the code below
is that we use the $
symbol to index which variable we want
to look at in the data.frame called data
.
See the extra materials for more
information on indexing.
table(data$aus.animal)
##
## echidna kangaroo koala platypus wombat
## 10 13 24 18 15
So above, we see that there are five separate values (if we ignore
missing data), and the most frequent response was
koala
.
Tidyverse Functions
The tidyverse package provides handy functions for data cleaning and
manipulation. Remember, these functions work like all other functions:
they take a number of arguments, such as a data.frame and instructions,
then gives you an output that can be saved as an object, such as another
data.frame. In this week’s demonstration, we will cover one function
that you can use to calculate descriptive statistics; however, there are
many more tidyverse functions. We will cover a few more next week, but
for a full list, see the
tidyverse cheatsheets, or view the help documentation using
help()
.
All tidyverse functions to do with data cleaning follow the same structure:
function_name(data.frame,instructions)
where the first argument is always the data.frame you wish to perform the function on, and the remaining arguments are instructions on how you wish to perform that function.
summarise()
Functions like mean()
and sd()
are great
when you want to calculate descriptive statistics from a vector, but
more often than not, you will need to get these values from a variable
in a data set. One way to do this is use the summarise()
function from the tidyverse package.
So how does the summarise()
function work? This function
takes multiple arguments. The first argument of the function is the
data.frame that has the variables you’re interested in. The remaining
arguments tells R which descriptive statistics you would like it to
calculate (i.e., multiple descriptive statistics can be calculated at
once). You can tell summarise()
which stats you want it to
calculate using the same functions we covered above (e.g.,
mean()
,sd()
, etc.). In return,
summarise()
will return another data.frame with the
descriptive statistics you’ve asked for. Therefore, for each stat that
you calculate, you will need to provide a new variable name. For each
descriptive stat that you want to calculate using
summarise()
, you will need to provide an additional
argument that takes the following form:
variable_name = function(variable)
This may sound confusing, but it helps to see it in action. Below is
the code to calculate the mean and standard deviation of the variable
stats_exp
in the data.frame called data
.
summarise(data,
stats_exp.mean = mean(stats.exp, na.rm = TRUE),
stats_exp.sd = sd(stats.exp, na.rm = TRUE))
## # A tibble: 1 × 2
## stats_exp.mean stats_exp.sd
## <dbl> <dbl>
## 1 4.29 2.06
A couple of things to note. First, I have broken the above code across several lines. This is just to make it easier to read (remember R ignores white space inside an incomplete command). Second, it doesn’t matter what you name the new variable (but ideally you want a label that is helpful).
As mentioned, you are not limited in the number of stats you can
compute using the summarise()
function, so by expanding the
code above, we can calculate the means and standard deviations of
multiple variables:
summarise(data,
stats_exp.mean = mean(stats.exp, na.rm = TRUE),
stats_exp.sd = sd(stats.exp, na.rm = TRUE),
stats_anx.mean = mean(stats.anx,na.rm = TRUE),
stats_anx.sd = sd(stats.anx,na.rm = TRUE))
## # A tibble: 1 × 4
## stats_exp.mean stats_exp.sd stats_anx.mean stats_anx.sd
## <dbl> <dbl> <dbl> <dbl>
## 1 4.29 2.06 5.59 2.30
Note: An annoying thing about the summarise()
function
is that sometimes it will not show results to two-decimal places (which
is the requirement for APA format). To see all the decimal places, you
can View()
the data.frame produced by the
summarise()
function.