Objectives

The aim of this week is to introduce the lavaan package for Structural Equation Modelling (e.g., path analysis) in R. By the end of this workbook, you should be familiar with:

  1. The basics of the lavaan package.
  2. Run a simple path analysis.
  3. Run a simple confirmatory factor analysis.
  4. Conduct model comparisons.

Resources

Here is a list of resources that may help you with this weeks content:

lavaan tutorial lavaan tutorial pdf

Class Data

Click here to download the data used in these workbooks. (You may need to right-click and select ‘Save Linked File’ option)

Content

We will be using the following packages in this week’s workbook. If you’ve been following along with the demonstrations, this is the first time we’ve used the lavaan and semPlot packages, so make sure to install these first before loading them. The lavaan package includes functions that allow us to conduct the analyses, while the semPlot package includes functions that allows us to visualise path diagrams.

library(lavaan)
library(tidyverse)
library(semPlot)

Before We Continue…

We are about to start using the lavaan package, which is used for path analyses and structural equation models. This workbook is only meant to serve as an introduction to the lavaan package, and therefore we will only discuss some of the basic functions. You are able to add a lot of complexity to structural equation models that we will not be touching. If you truly wish to master this package, you will need to go beyond the content covered here.

Mediation Analysis Through Structural Equation Modelling (SEM)

Mediation is an example of a simple SEM. We will first re-run the mediation we conducted in Demonstration 8 in the lavaan package, as this is a handy way of demonstrating the logic of the lavaan package. As a reminder, we predict that the association between exercise attitudes and exercise behaviour is mediated by the intention to exercise.

As with all analyses, we will be following the basic procedure of:

  1. Clean the data for analysis.
  2. Run the statistical test.
  3. Plot the data.
  4. Write-up analysis.

1. Clean the data for analysis.

As always, our first step is to calculate the variables from the individual items.

The code below is identical to the code used for the mediation in Week 8 and used to calculate exercise attitude, intentions to exercise, and exercise behaviour.

data2.clean <- data %>%
  mutate(attitude = exercise.attitude.1 + exercise.attitude.2 + exercise.attitude.3 + exercise.attitude.4 + exercise.attitude.5,
         intention = exercise.intention.1 + exercise.intention.2 + exercise.intention.3 + exercise.intention.4 + exercise.intention.5,
         behaviour = exercise.behaviour.1 + exercise.behaviour.2 + exercise.behaviour.3 + exercise.behaviour.4 + exercise.behaviour.5) %>%
  dplyr::select(student.no,attitude,intention,behaviour) %>%
  filter(!is.na(attitude)) %>%
  filter(!is.na(intention)) %>%
  filter(!is.na(behaviour))

2. Run statistical test.

Previously, when specifying a regression formula in R, we have used the following form:

y ~ x1 + x2 + x3 + …

An SEM formula is an extention of of this, where multiple regression equations (or similar relationships) are estimated at once. Like with a regression, we use the ~ to specify a relationship between an outcome (on the left), and predictors (on the right).

In the table below are some other relationships that you may use when constructing a SEM.

operator relationship type example explanation
=~ Latent Variable y =~ x1 + x2 + x3… y is measured by x1, x2, x3, etc.
~~ Covariance x1 ~~ x2 x1 is correlated with x2
:= Define Parameter effect := a*b Estimate ‘effect’ which is made up of ‘a’ times ‘b’

Note: you can also label parameters by using the * symbol before variables. If you give two variables the same label, it will force the estimates to be the same (this comes in handy sometimes, but not something we will use in this workbook). Note: the label you give cannot have the same name as a variable in the data.frame (otherwise R will think you are referring to that variable and multiply to two together!).

Below is the formula to conduct a mediation analysis identical to the one we conducted in Workbook 8. See if you can understand each line and what relationship it is specifying before reading the breakdown. Remember, each line of code can be read as relationships between variables in the model.

sem.formula <- '
behaviour ~ c*attitude
behaviour ~ b*intention
intention ~ a*attitude

indirect := a*b
direct := c
total := direct + indirect
'

Let’s break this down:

line explanation
behaviour ~ c*attitude Behaviour is predicted from attitude, and this relationship is labelled ‘c’
behaviour ~ b*intention Behaviour is predicted from intention, and this relationship is labelled ‘b’
intention ~ a*attitude Intention is predicted from attitude, and this relationship is labelled ‘a’
indirect := a*b Define a parameter called ‘indirect’ which is made up of ‘a’ times ‘b’
direct := c Define a parameter called ‘direct’ which is made up of ‘c’
total := direct + indirect Define the total effect (the combination of direct and indirect effect).

Essentially, the first three lines of the code are identical to running the Model 1, Model 2, and Model 3 that we ran when testing whether the conditions of mediation were met, while the last three lines are like running the mediate() function to get estimates for the indirect and direct effect.

To run the SEM, we use the sem() function. Like all analyses done previously, we need to specify two things: the formula (which we saved as the object sem.formula) and the data.frame. We also set values for a few other arguments. We set ‘standardized’ to TRUE so that we get standardised estimates, ‘fit.measures’ to TRUE so we get model fit indicators, and ‘rsquare’ to TRUE to get an r-square statistics for the model.

med.sem.model <- sem(sem.formula,data2.clean)
summary(med.sem.model,standardized = TRUE, fit.measures = TRUE, rsquare = TRUE)
## lavaan 0.6-18 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                            80
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                                20.891
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -520.013
##   Loglikelihood unrestricted model (H1)       -520.013
##                                                       
##   Akaike (AIC)                                1050.026
##   Bayesian (BIC)                              1061.936
##   Sample-size adjusted Bayesian (SABIC)       1046.170
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value H_0: RMSEA <= 0.050                       NA
##   P-value H_0: RMSEA >= 0.080                       NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   behaviour ~                                                           
##     attitude   (c)    0.591    0.165    3.572    0.000    0.591    0.369
##     intention  (b)    0.177    0.114    1.557    0.119    0.177    0.161
##   intention ~                                                           
##     attitude   (a)    0.327    0.158    2.064    0.039    0.327    0.225
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .behaviour        39.633    6.267    6.325    0.000   39.633    0.811
##    .intention        38.279    6.052    6.325    0.000   38.279    0.949
## 
## R-Square:
##                    Estimate
##     behaviour         0.189
##     intention         0.051
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     indirect          0.058    0.047    1.243    0.214    0.058    0.036
##     direct            0.591    0.165    3.572    0.000    0.591    0.369
##     total             0.649    0.164    3.965    0.000    0.649    0.405

If we compare the results here to mediation, we should see that we get very similar, if not the same, results.

3. Plot the model.

The easiest way to plot a structural equation model is using a path diagram. Unfortunately, this is not easy to do in ggplot2. However, we can use the semPaths() function from the semPlot package. This function has the advantage of being easy to use - all you need to do is load your model estimated in laavan and specify which paths you want to see (or hide). The disadvantage is that editing the layout or customising the look of your path diagram can be difficult. Therefore, in order to create publication worthy graphics, you may want to recreate the diagram in another program.

semPaths(med.sem.model,whatLabel = "std",intercept = FALSE,residuals = FALSE)

More complicated Path Analysis

We will now have a go at conducting a more complicated SEM. We will continue investigating the relationship between attitudes to exercise, intentions to exercise, and exercise behaviour, but also now investigate norms surrounding fitness and perceived behavioural control. The model we will fit is based on the Theory of Planned Behaviour, and is depicted below.

Theory of Planned Behaviour
Theory of Planned Behaviour

In words, we predict that attitudes towards exercise, norms about exercise, and perceived behavioural control regarding exercise will predict intention to exercise. In turn, intention to exercise predicts exercise behaviour. We also predict that perceived behavioural control regarding exercise will directly predict exercise behaviour.

1. Clean the data for analysis.

First, we must calculate the variables from each of the items. Something we have done many times now.

data2.clean <- data %>%
  mutate(attitude = exercise.attitude.1 + exercise.attitude.2 + exercise.attitude.3 + exercise.attitude.4 + exercise.attitude.5,
         intention = exercise.intention.1 + exercise.intention.2 + exercise.intention.3 + exercise.intention.4 + exercise.intention.5,
         control = exercise.control.1 + exercise.control.2 + exercise.control.3,
         norms = exercise.norms.1 + exercise.norms.2 + exercise.norms.3 + exercise.norms.4 + exercise.norms.5,
         behaviour = exercise.behaviour.1 + exercise.behaviour.2 + exercise.behaviour.3 + exercise.behaviour.4 + exercise.behaviour.5) %>%
  dplyr::select(student.no,attitude,intention,control,norms,behaviour) %>%
  filter(!is.na(attitude)) %>%
  filter(!is.na(intention)) %>%
  filter(!is.na(control)) %>%
  filter(!is.na(norms)) %>%
  filter(!is.na(behaviour))

2. Run statistical test.

First, we need to specify the model. Below, this model is specified as formula that lavaan can read:

model <- '
intention ~ attitude + norms + control

norms ~~ attitude
attitude ~~ control
control ~~ norms

behaviour ~ intention + control
'

To break down the code above: Line 1: Attitudes, norms, and control predict intention. Lines 2 to 4: Covariance between predictors is specified explicitly. Line 5: Intention, control and attitudes predicts behaviour.

We then run the SEM model using the sem() function:

sem.model <- sem(model,data = data2.clean)
summary(sem.model,standardized = TRUE,fit = TRUE, rsquare = TRUE)
## lavaan 0.6-18 ended normally after 46 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        14
## 
##   Number of observations                            80
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 2.777
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.096
## 
## Model Test Baseline Model:
## 
##   Test statistic                                76.303
##   Degrees of freedom                                10
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.973
##   Tucker-Lewis Index (TLI)                       0.732
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1218.923
##   Loglikelihood unrestricted model (H1)      -1217.534
##                                                       
##   Akaike (AIC)                                2465.845
##   Bayesian (BIC)                              2499.193
##   Sample-size adjusted Bayesian (SABIC)       2455.046
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.149
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.370
##   P-value H_0: RMSEA <= 0.050                    0.129
##   P-value H_0: RMSEA >= 0.080                    0.821
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.035
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   intention ~                                                           
##     attitude          0.062    0.152    0.410    0.682    0.062    0.043
##     norms             0.436    0.083    5.256    0.000    0.436    0.495
##     control           0.427    0.169    2.531    0.011    0.427    0.262
##   behaviour ~                                                           
##     intention         0.125    0.111    1.130    0.258    0.125    0.114
##     control           0.518    0.195    2.652    0.008    0.518    0.289
##     attitude          0.407    0.173    2.352    0.019    0.407    0.254
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   attitude ~~                                                           
##     norms             4.336    3.549    1.222    0.222    4.336    0.138
##     control           7.390    2.074    3.563    0.000    7.390    0.434
##   norms ~~                                                              
##     control          -1.594    3.143   -0.507    0.612   -1.594   -0.057
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .intention        27.569    4.359    6.325    0.000   27.569    0.684
##    .behaviour        36.430    5.760    6.325    0.000   36.430    0.746
##     attitude         19.069    3.015    6.325    0.000   19.069    1.000
##     norms            51.865    8.201    6.325    0.000   51.865    1.000
##     control          15.185    2.401    6.325    0.000   15.185    1.000
## 
## R-Square:
##                    Estimate
##     intention         0.316
##     behaviour         0.254

Notice above in the summary() function, we have included the additional arguments to get extra statistics that are helpful when it comes to report the analysis.

3. Plot the model.

Again, we use semPaths() to plot our SEM model. Here, the limitations of not being able to easily organise the layout of your plot is apparent, though it still may be helpful to visualise your results to help your understanding.

semPaths(sem.model,whatLabel = "std",intercept = FALSE,residuals = FALSE)

Model Comparisons

Model comparisons can be used to help decide which model best explains your data with the fewest number of parameters. As covered in the lecture series, models can only be compared if they are nested (i.e., one model must be contained within the other model). For SEM, using model comparisons is particularly useful to determine the significance of individuals paths.

In the example below, we will compare the model above with a simpler model, where the path between control and behaviour is removed. The formula and code to conducted this slightly simpler model is as follows:

model2 <- '
intention ~ attitude + norms + control

norms ~~ attitude
attitude ~~ control
control ~~ norms

behaviour ~ intention
'

sem.model2 <- sem(model2,data = data2.clean)

#summary(sem.model2,standardized = TRUE,fit = TRUE, rsquare = TRUE)

Note: we have not run the summary() function above to save on space, but feel free to have a look at it.

Now that we have our two models that we will compare, the function we use is called anova(), not to be confused with aov() that runs ANOVA analyses.

For the anova() function, we simply have to specify the two models we are comparing as the two arguments. This will conduct a chi-square difference test.

anova(sem.model,sem.model2)
## 
## Chi-Squared Difference Test
## 
##            Df    AIC    BIC   Chisq Chisq diff   RMSEA Df diff Pr(>Chisq)    
## sem.model   1 2465.8 2499.2  2.7772                                          
## sem.model2  3 2480.4 2509.0 21.3544     18.577 0.32188       2  9.247e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Above, we find that there is a significant difference between the two models. Therefore, we can conclude that dropping the path between ‘control’ and ‘behaviour’ significantly reduces model fit. As such, we should retain this path in the model.

While above, we use the anova() function to test the difference between nested SEM models, we can also use this function to compare models from other functions, such as lm().

Introducing Latent Variables

Up to now, to calculate variables from individual items, we have been adding each item up. However, we could also compute latent variables, which are the underlying variables that each item is meant to tap into. Essentially, this is conducting a Confirmatory Factor Analysis.

For example, we could calculate the latent variable ‘norms’ from each of the individual items.

attitude.formula <- '
  attitude =~ exercise.attitude.1 + exercise.attitude.2 + exercise.attitude.3 + exercise.attitude.4 + exercise.attitude.5
'

attitude.model <- sem(attitude.formula,data = data)
summary(attitude.model,standardized = TRUE)
## lavaan 0.6-18 ended normally after 28 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        10
## 
##   Number of observations                            80
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 7.767
##   Degrees of freedom                                 5
##   P-value (Chi-square)                           0.170
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   attitude =~                                                           
##     exercis.tttd.1    1.000                               0.741    0.638
##     exercis.tttd.2    1.526    0.286    5.330    0.000    1.131    0.868
##     exercis.tttd.3   -0.206    0.352   -0.586    0.558   -0.153   -0.072
##     exercis.tttd.4    0.889    0.193    4.607    0.000    0.659    0.627
##     exercis.tttd.5    1.331    0.273    4.875    0.000    0.987    0.675
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .exercis.tttd.1    0.800    0.149    5.365    0.000    0.800    0.593
##    .exercis.tttd.2    0.418    0.173    2.425    0.015    0.418    0.246
##    .exercis.tttd.3    4.536    0.718    6.318    0.000    4.536    0.995
##    .exercis.tttd.4    0.671    0.124    5.429    0.000    0.671    0.607
##    .exercis.tttd.5    1.166    0.228    5.118    0.000    1.166    0.545
##     attitude          0.550    0.189    2.906    0.004    1.000    1.000
semPaths(attitude.model,whatLabel = "std",intercept = FALSE,residuals = FALSE)

How about we compute latent variables for all the variables in the TPB model above? Doing something like this may not make a lot of sense: we will end up with a gangly model, and we may not have enough observations to get sensible results, but we can do it do it anyway! For demonstrative purposes of course… The code is provided below to run this analysis, but this code is not executed, as the output is quite long an unweildy.

Specify the Model

model <- '
  attitude =~ exercise.attitude.1 + exercise.attitude.2 + exercise.attitude.3 + exercise.attitude.4 + exercise.attitude.5
  intention =~ exercise.intention.1 + exercise.intention.2 + exercise.intention.3 + exercise.intention.4 + exercise.intention.5
  control =~ exercise.control.1 + exercise.control.2 + exercise.control.3
  norms =~ exercise.norms.1 + exercise.norms.2 + exercise.norms.3 + exercise.norms.4 + exercise.norms.5
  behaviour =~ exercise.behaviour.1 + exercise.behaviour.2 + exercise.behaviour.3 + exercise.behaviour.4 + exercise.behaviour.5

  intention ~ attitude + norms + control
  behaviour ~ intention + control
'

We won’t run the summary() just now because it is very unwieldy, but we have covered the interpretation of the output above.

sem.model3 <- sem(model,data = data)
summary(sem.model3,standardized = TRUE,fit = TRUE, rsquare = TRUE)

Plot the Model

Again, we can plot the SEM using semPaths(), but this gets pretty messy, given the number of items make up each latent variable.

semPaths(sem.model3,whatLabel = "std",intercept = FALSE,residuals = FALSE)

Exercises

Now that you’ve completed this week’s workbook, why not give this week’s exercises a go? You can download the interactive exercises by clicking the link below.

Click here to download this week’s exercises.