This page contains extra R content not covered in the demonstrations and could be considered supplementary to the module. This content is useful for completing the advanced exercises from Week 6 and focuses on conducting chi-square tests in R. This includes the chi-square goodness of fit test, and the chi-square test of independence.
To conduct chi-square tests in R, we use the
chisq.test()
function. Unlike most of the statistical
analysis functions we’ve looked at so far, this function does not accept
a formula and a dataset. Instead, the main argument that this function
expects is a contingency table.
To create a contingency table in R, we can use the
table()
function. We have briefly talked about the
table()
function (to view the frequencies for a single
variable)[https://antlee53.github.io/stirpsychstats/2data.html#table()_Function],
but as a quick recap, if you wanted to view the frequencies for
favourite Australian animals in the class dataset, the code would
be:
table(data$aus.animal)
##
## echidna kangaroo koala platypus wombat
## 10 13 24 18 15
However, the table()
function can also be used to create
a contingency table. A two-variable contingency table will be created
from a data.frame that only has two variables in it. Therefore, you can
use the select()
function (covered in Week 3) to create a
data.frame that only includes the two variables that you’re interested
in.
So for example, if we wanted to create a contingency table of videogamers vs. non-videogamers across the three programmes in the class dataset, the code would look like this:
select(data,video.games,program) %>%
table()
Alternatively, this could be done by inputting two vectors of variables from the same dataset as arguments. Here is that code:
table(data$video.games,data$program)
##
## Conversion MSc Health MSc Other Research MSc
## No 18 10 4
## Yes 18 17 13
As covered in the lecture series, the chi-square goodness of fit test is used to compare the observed distribution of a single categorical variable with an expected distribution.
The function that performs a chi-square goodness of fit test is the
chisq.test()
function. There are two inputs we require.
First, is a numeric vector with the observed frequencies. Second, is the
probability of the expected frequencies (argument named
p
).
For instance, if we conducted a study that counted the frequency of 100 people’s favourite colour, and observed 20 people reported “red”, 35 people reported “green”, and 45 people reported “blue”, then the first argument would be:
c(20,35,45)
## [1] 20 35 45
Note: if the variable we are interested in is a variable in a
dataset, then, as described above, we can use the table()
function to get the frequencies.
If we expect an equal distribution among the three colours, our expected probabilities would be represented as:
c(1/3,1/3,1/3)
## [1] 0.3333333 0.3333333 0.3333333
Altogether, to conduct the chi-square goodness of fit test, we input
these vectors into the chisq.test()
function:
chisq.test(c(20,35,45),p = c(1/3,1/3,1/3))
##
## Chi-squared test for given probabilities
##
## data: c(20, 35, 45)
## X-squared = 9.5, df = 2, p-value = 0.008652
Following the example from the lecture series, let’s conduct a chi-square goodness of fit test for favourite Australian animals using the class dataset. We want to compare the frequencies in the class dataset with the expected proportions form a national UK poll to see if the class distribution is similar to national rates, or if there’s something different about this cohort.
To conduct this analysis, we enter the frequencies from the dataset as the first argument, and a vector with the expected probabilities as the second argument.
chisq.test(table(data$aus.animal),p = c(.0171,.2222,.4615,.1624,.1368))
##
## Chi-squared test for given probabilities
##
## data: table(data$aus.animal)
## X-squared = 63.706, df = 4, p-value = 4.82e-13
To report a chi-square test, you need the following information: * The chi-square statistic (the test statistic). * The degrees of freedom. * The p-value.
Once you have this information, the write-up becomes:
A chi-square goodness of fit test found a significant difference between the class distribution of favourite Australian animals and the expected values based on national rates, chi-square(4) = 63.71, p < .001.
The chi-square test of independence is used to determine if the distribution of frequencies of a categorical DV are different at different levels of an IV.
The chi-square test of independence uses the same function as the chi-square goodness of fit test, but the inputs are different. The function is smart enough to know which test to conduct given which inputs it receives.
If you input a contingency table that has 2 variables, then the
function knows to conduct a chi-square test of independence. As
described above, contingency tables can be created using the
table()
function.
As such, if we were to test whether the proportion of video gamers was different across the three programmes in the class dataset, the code looks like this:
c.table <- select(data,video.games,program) %>%
table()
chisq.test(c.table)
##
## Pearson's Chi-squared test
##
## data: c.table
## X-squared = 3.5203, df = 2, p-value = 0.172
To write-up a chi-square test of independence, you need the same information as above, being the test statistic, the associated degrees of freedom, and the p-value. Altogether, the write-up then can look something like this:
A chi-square test of independence did not find a significant difference in videogamers across the three programmes, chi-square(2) = 3.52, p = 0.172.
If you would like to practice the skills on this page, weekly exercise questions on this content are available in the advanced exercises for Week 6. You can download the interactive exercises by clicking the link below.