plssem : A Stata Package for Structural Equation Modeling with Partial Least Squares

We provide a package called plssem that ﬁts partial least squares structural equation models, which is often considered an alternative to the commonly known covariance-based structural equation modeling. plssem is developed in line with the algorithm provided by Wold (1975) and Lohmöller (1989). To demonstrate its features, we present an empirical application on the relationship between perception of self-attractiveness and two speciﬁc types of motivations for working out using a real-life data set. In the paper we also show that, in line with other software performing structural equation modeling, plssem can be used for putting in relation single-item observed variables too and not only for latent variable modeling


Introduction
The traditional statistical techniques (e.g., linear regression, logistic regression, multilevel regression, etc.) are used to estimate models representing the relationship between one or more than one independent variable and a single dependent variable.The independent and dependent variables in these models are all measured using single items such as income, height, weight, length of education and so on.Following this reasoning, we can refer to these traditional statistical approaches as single-equation techniques containing single-item variables both on the left-hand side (dependent) and right-hand side (independent) of the equation.Typically, these methods are employed in the social sciences to explain and predict quantities of interest.
Structural equation modeling (SEM) too can be used for explanation and prediction purposes in the social sciences.The difference, and accordingly the advantage of SEM over single-equation techniques, is that SEM allows for estimating the relationship between a number of independent variables and more than one dependent variable at the same time.Furthermore, while the traditional techniques such as regression analysis lets one only use single-item variables, SEM allows for use of multi-item independent and dependent variables.
As such, in a broader sense, we can refer to SEM as a simultaneous multiple-equation technique estimating models including single or/and multi-item variables on both sides of the equations.This broader definition reflects also the reason why in the course of the past four decades SEM has become probably the most popular statistical estimation technique in the social sciences.The approach to incorporating the multi-item variables in SEM has basically led to the development of two different methods: covariance-based structural equation modeling (COV-SEM) introduced by Jöreskog (1969), and variance-based structural equation modeling (VAR-SEM) proposed by Wold (1975).While in COV-SEM the paths between common factors are examined, in VAR-SEM the paths between weighted composites (replacing the common factors) are estimated.This implies that in COV-SEM, multi-item variables are incorporated into the model using the factor analytic technique whereas in VAR-SEM weighted composites are generated from multi-item variables.
In a nutshell, we can view COV-SEM as the factor-based and VAR-SEM as the componentbased structural equation modeling methods (Chin 1995).COV-SEM and VAR-SEM are commonly referred to in the literature respectively as maximum likelihood SEM (ML-SEM; see for example Bollen 1989; Kline 2016), which is typically associated with software packages such as LISREL (Jöreskog and Sörbom 2015), EQS (Bentler 2008), AMOS (Arbuckle 2014) or Mplus (Muthén and Muthén 2017), and partial least squares (PLS-SEM or PLS-PM; see for example Esposito Vinzi, Trinchera, and Amato 2010), which are instead associated with the software packages SmartPLS (Ringle, Wende, and Becker 2015) or XLSTAT (Addinsoft 2007).
Although there is an ongoing debate as to the strengths and weaknesses of COV-SEM and PLS-SEM in the literature (see for example Rönkkö and Evermann 2013;Henseler, Dijkstra, Sarstedt, Ringle, Diamantopoulos, Straub, Ketchen, Hair, Hult, and Calantone 2014), there still appears to be a general consensus that these two approaches should be considered complementary rather than alternatives to each other.In line with this observation, Hair, Hult, Ringle, and Sarstedt (2017, p. 23) suggest PLS-SEM be used when: • the goal is predicting key target constructs; • formatively measured constructs are part of the structural model; • the structural model is complex including many indicators/constructs; • the sample size is small; • the plan is to use latent variable scores in further analyses.
For more details on the pros and cons of the PLS-SEM approach versus COV-SEM we refer the reader to Hair et al. (2017).
In this paper we present the plssem package for Stata (StataCorp 2017).The aim of the package is to provide an open-source implementation of the PLS-SEM methodology for Stata.
To the best of our knowledge, the only package currently available for fitting PLS-SEM models in Stata is the user-contributed pls package developed by Rönkkö (2016).However, as indicated in the package's documentation, pls is provided for educational purposes since it only calculates composite variables and it does not produce any other output as well as it does not allow for any further postestimation analysis of a PLS-SEM fitted model.Essentially, we started from the pls code as a basis for the development of our plssem package, but then we fully redesigned and enhanced it with numerous additional output and tools for postestimation.
At the time of writing, PLS-SEM is not supported by any of the most popular commercial statistical software packages like SAS (SAS Institute Inc. 2013) or SPSS (IBM Corporation 2017), which only support PLS regression1 .However, many open-source and commercial software packages have been developed independently over the years for fitting PLS-SEM models.Currently, the most widespread open-source implementations of the PLS-SEM methodology are all for the R software (R Core Team 2019), in particular matrixpls (Rönkkö 2017), plspm (Sanchez, Trinchera, and Russolillo 2017) and semPLS (Monecke and Leisch 2012).For what regards the commercial packages, the most popular ones are SmartPLS and XLSTAT-PLSPM 2 .We now provide a brief overview for each of them.A further now dated comparison of PLS path modeling software is also available in Temme, Kreis, and Hildebrandt (2010).
matrixpls in R: The matrixpls package implements a collection of PLS techniques as well as the more recent generalized structured component analysis (GSCA) introduced by Hwang and Takane ( 2004) (for a detailed presentation see Hwang and Takane 2014) and consistent partial least squares (PLSc) techniques as discussed in Dijkstra and Henseler (2015a,b).The variance of the PLS results is estimated using the bootstrap approach (Davison and Hinkley 1997) through the matrixpls.boot()function, which provides the integration with the boot package (Canty and Ripley 2017).matrixpls is the most recent addition in the set of R packages for PLS-SEM and, in contrast to all the other software packages for the same purpose which work with raw data, it calculates the indicator weights and model estimates from data covariance matrices.
The main function, matrixpls(), requires that the model specification is performed by providing the list of user-defined adjacency matrices specifying the association between the different variables.Additional functions for postestimation (predictions, residual analysis, model quality indices) are also provided.No method is provided in the package to deal with observed heterogeneity such as multigroup analysis (MGA).
plspm in R: This is an R package developed by Sanchez et al. (2017) and dedicated to PLS-SEM analysis.The package comes with a number of functions to perform a series of different types of analysis including bootstrapping.The main function has the same name as the package, plspm(), which is designed for running a full PLS-SEM analysis.
The package includes also some accessory functions for plotting and displaying results.Additionally, the function plspm.groups()allows to compare two groups (multigroup analysis) and it offers two options for doing the comparison, a bootstrap t test and a nonparametric permutation test.Finally, the package also includes a set of functions for the detection of latent classes by using the REBUS-PLS approach for uncovering unobserved heterogeneity in PLS-SEM models (Trinchera 2007;Esposito Vinzi, Trinchera, Squillacciotti, and Tenenhaus 2008).As for matrixpls, model specification occurs through user-defined adjacency matrices.A book-length description of the package is provided in Sanchez (2013).
semPLS in R: This is a further package developed by Monecke and Leisch (2012) for structural equation modeling with partial least squares in R. The plsm() function is used to create a valid model specification (a so called 'plsm' object), while sempls() fits the model.Models can be specified by providing the user-defined adjacency matrices.
Bootstrapping is available too by leveraging the boot package (Canty and Ripley 2017) and the calculation of quality indices (R 2 , Q 2 , Dillon-Goldstein's ρ, etc.) is performed via specific methods.However, no method is provided in the package for dealing with observed (e.g., MGA) or unobserved heterogeneity (e.g., REBUS-PLS).A distinctive feature of semPLS is that it is possible to export 'plsm' objects for use with the popular sem package (Fox, Nie, and Byrnes 2017).Similarly, it is possible to import model specification created with SmartPLS with the function read.splsm().
SmartPLS: Now in its third official release (http://www.smartpls.com),it is a stand-alone commercial software supported by a community of scholars centered at the University of Hamburg (Germany), School of Business (Hair et al. 2017), which currently represents by far the most popular and comprehensive software implementation of the PLS-SEM methodology.Model specification is performed by drawing the structural model for the latent variables and by assigning the indicators to the latent variables through an easy to use GUI.SmartPLS provides state of the art PLS techniques for fitting PLS-SEM models including bootstrapping and nonlinear relationships.Both observed and unobserved heterogeneity can be accounted for using several approaches such as MGA and finite mixture (FIMIX) segmentation (Hahn, Johnson, Herrmann, and Huber 2002;Sarstedt, Becker, Ringle, and Schwaiger 2011).Finally, mediation and moderation (interaction effects) analysis are also available, as well as hierarchical component models (second-order models) for fitting more complex structural models.
XLSTAT-PLSPM: XLSTAT is a complete statistical add-in for Microsoft Excel developed by Addinsoft (http://www.xlstat.com).It is structured in modules that provide specialized suites of commands to analyze data in different fields (biomedical sciences, ecology, marketing, psychology, quality control and sensory analysis).XLSTAT-PLSPM is the module that provides the estimation of PLS path models.The package includes all the recent methodological features of the PLS-SEM approach.In particular, it provides bootstrapping but also MGA and REBUS-PLS for dealing with observed and unobserved heterogeneity respectively.
The plssem package for Stata presented in this manuscript includes the following features: • Model specification using an equation-like style.
• Standard and bootstrap based estimation of PLS-SEM models.
• Mediation analysis through estimation and inference (including bootstrap) for up to five indirect effects.
• Moderation analysis through the inclusion of interactions among latent variables in the structural model specification; this provides an implementation of the so called product indicator approach (Sanchez 2013, Section 7.3).
• Possibility to fit models that include equations with binary dependent variables.To the best of our knowledge, none of the existing PLS-SEM software facilitates binary dependent variable estimation using maximum likelihood.
• Multigroup analysis of outer loadings and path coefficients for dealing with observed heterogeneity; in particular, it allows the comparison of an arbitrary number of groups using either normal-based, bootstrap or permutation tests.
• Potential to estimate higher-order construct models (sometimes also, maybe inappropriately, called hierarchical models; see for example Lohmöller 1989, Section 3.5).
• A range of graphical and postestimation commands for representing and inspecting the results of a fitted PLS-SEM model.
The plssem package is available through the Statistical Software Components (SSC) archive3 , often called the Boston College Archive, and it allows to fit various PLS-SEM models.To install the package one needs to execute the command .ssc install plssem which will download and copy all the ado, help and data files for the commands discussed here4 .
The rest of the paper is organized as follows: In Section 2 we present the main technical aspects of the PLS-SEM approach as well as the most common indicators discussed in the literature for assessing the quality of a fitted model.Section 3 provides an introduction to the plssem package.In particular, after discussing the general syntax, we provide a full description of the available options.Moreover, we present the different postestimation commands one can run after fitting a model and the objects that are saved during the estimation.These objects can clearly be used in subsequent analyses.In Section 4 we show some empirical applications of the PLS-SEM approach with the plssem package using two different data sets.Finally, Section 5 provides some closing thoughts and our plans for the future releases of the package.
In the rest of the paper we adopt the same mathematical notation provided by Monecke and Leisch (2012, pp. 9-13).

The PLS-SEM methodology
As PLS-SEM resembles ML-SEM in many ways, it can be explained and illustrated using a slightly adjusted version of the LISREL terminology (Jöreskog, Olsson, and Wallentin 2016) and graphical notation used originally for ML-SEM.As depicted in Figure 1, a typical PLS-SEM model will consist of two parts: the measurement (or outer) and the structural (or inner) models.
x The measurement model provides the relationships between latent variables (or constructs5 ) and the indicators they are defined by.The measurement part is represented in Figure 1 by all arrows apart from those included in the dashed box.The example includes two reflective (i.e., y 1 and y 3 ) and one formative (i.e., y 2 ) construct.The association between the reflective constructs and the corresponding indicators (that is the arrows pointing from the constructs to the indicators) is indicated in the picture by λ 11 , λ 12 , λ 13 , λ 37 , λ 38 and λ 39 , which are also called outer loadings.The relationship between the formative construct and the corresponding indicators (i.e., the arrows pointing from indicators to constructs) are denoted with w 24 , w 25 and w 26 and are also referred to as outer weights.All indicators are congeneric in that none of them loads on more than one construct decided a priori (Brown 2015).The measurement model can be described by an adjacency matrix M whose entries m kj take the value one if indicator x k belongs to the block that defines latent variable y j , and zero otherwise, with k = 1, . . ., K and j = 1, . . ., J. The adjacency matrix of the measurement model for the example shown in Figure 1 is provided in Table 1.Note that the matrix M does not convey any information about whether a construct is measured in a reflective or formative way.
The structural model shows the relationships between latent variables themselves.For the example shown in Figure 1 the structural model is represented by the arrows included in the dashed-line box.Latent variables in the structural model that are used as predictors are called exogenous, while those denoted as outcome variables are called endogenous.In our Table 1: Measurement model adjacency matrix M for the example shown in Figure 1.The elements m kj of the matrix are set to one if indicator x k belongs to the block that defines latent variable y j , and zero otherwise.
Table 2: Structural model adjacency matrix S for the example shown in Figure 1.The elements s ij of the matrix are set to one if the latent variable y i is a predecessor of the latent variable y j in the model, and zero otherwise.
example there are two exogenous (y 1 and y 2 ) and one endogenous (y 3 ) latent variable.The relationships among the latent variables are labeled using the corresponding path coefficients (β 13 and β 23 ).The structural model can also be summarized by an adjacency matrix S whose entries s ij take the value one if the latent variable y i is a predecessor of the latent variable y j in the model, and zero otherwise, with i, j = 1, . . ., J. The adjacency matrix of the structural model for the example shown in Figure 1 is reported in Table 2.Note that matrix S allows to recover the information about whether a latent variable is exogenous or endogenous.More specifically, if the column corresponding to the latent variable y j contains only zeros, that indicates that y j is exogenous.In other words, contrary to the matrix M for the measurement model, S accounts for the directionality of the relationships among the latent variables.
To sum up the description of a PLS-SEM model, the structural part is similar to a regression model, while the measurement part resembles a factor or a principal component analysis.As such, PLS-SEM can be viewed as an advanced multivariate technique facilitating these two analyses at one go.

The PLS-SEM estimation algorithm
The algorithm used to estimate a PLS-SEM model consists basically of three sequential stages6 (Lohmöller 1989).In the first stage, latent variable scores are iteratively estimated for each case.Using these scores, in the second stage measurement model parameters (weights/loadings) are estimated.In the same manner, in the third stage structural model parameters (path coefficients) are finally estimated.The first stage is what makes PLS-SEM a novel method in that the second and third stages, as it will be shown below, are about conducting a series of regression analysis using the ordinary least squares method.To help the reader grasp the whole process, we summarize the procedure for PLS-SEM estimation in Algorithm 1.We provide now more details on each stage.

Stage I -Iterative estimation of latent variable scores
The first stage is an iterative process consisting of the following steps, which are carried out to estimate the latent variable scores: Step 0: Initialization of the latent variable scores.
Step 1: Estimation of the inner weights.
Step 2: Inner approximation of the latent variable scores.
Step 3: Estimation of the outer loadings/weights.
Step 4: Outer approximation of the latent variable scores.
We now give a brief description of these steps.We will denote the data matrix including all the indicators as X, and the block of indicators measuring the jth latent variable y j as X j .
Similarly, we indicate with Y the whole set of latent variables.As it is common in the SEM and PLS-SEM literature, we assume that prior to starting the entire process all indicators are standardized to have a mean zero and unit variance.Additionally, after each step the latent variables are scaled likewise.
Step 0: Initialization of the latent variable scores.In general, we estimate latent variable scores as a weighted sum of the indicators in the corresponding block.In the first step, each latent variable is initialized setting all weights equal to one.In other terms, initially we compute the scores as where M is the measurement model adjacency matrix presented in Section 2, such as that reported in Table 1.
Step 1: Estimation of the inner weights.Inner weights are calculated for each latent variable to reflect how strongly the other latent variables are connected to it.The three most common schemes used for computing the inner model weights are the centroid scheme originally proposed by Wold (1982), and the factorial and path schemes introduced by Lohmöller (1989).We provide below a brief description of these schemes assuming that we collect all the inner weights in a matrix denoted as E.
Algorithm 1 The PLS-SEM estimation algorithm.4: Scale the latent variables scores to have zero mean and unit variance.5: Set the iteration counter to zero (i ← 0) and the maximum relative difference of the outer weights δ to 1 (δ ← 1).6: while δ ≥ tol and i < i max do 7: Estimate the inner weights using either Equations 1, 2 or 3 forming matrix E.

8:
Compute the inner approximation of the latent variable scores as Y = Y E.

9:
Scale the latent variables scores to have zero mean and unit variance.Compute the outer weights as else if y j is in the set of mode B latent variables then 14: Compute the outer weights as Compute the outer approximation of the latent variable scores as where W is a diagonal matrix collecting the estimated weights w j , for j = 1, . . ., J.

18:
Scale the latent variables scores to have zero mean and unit variance.Increase the iteration counter (i ← i + 1).21: end while 22: for j ← 1, J do 23: if y j is in the set of mode A latent variables then 24: Compute the cross loadings as λ cross j = COR(X, y j ).

25:
Compute the outer loadings as else if y j is in the set of mode B latent variables then 27: Compute the outer weights as X j y j .

29:
Compute the structural model parameters (i.e., the path coefficients) as 30: end for Centroid scheme: This scheme produces weights e ij based on the sign of r ij = COR(y i , y j ), the empirical linear correlation coefficient between the latent variables y i and y j resulting from the outer approximation (Step 4 below7 ), assuming they are neighbors.In particular, if y i and y j are adjacent, the weight e ij is set to +1 if the correlation is positive and to −1 if the correlation is negative.If y i and y j are nonadjacent, e ij is set to 0. More formally, for i, j = 1, . . ., J, where c ij is the (i, j)th element of the matrix C = S + S , with S the adjacency matrix of the structural model introduced in Section 2. Thus, C is a symmetric matrix whose element c ij takes value one if the latent variables y i and y j are neighbors in the structural model, and zero otherwise.
Note that, as implied by Equation 1, correlations very close to zero may cause the weights to take a non-zero value during the iterative process, which may lead to instability.Thus, the centroid scheme should be used when the indicators of a block (latent variable) are strongly correlated to each other, otherwise the factorial scheme is usually recommended (Esposito Vinzi et al. 2010).
Factorial scheme: In this scheme the correlation value between each pair of latent variables is directly used as the weight, that is with the same interpretation of the notation as above.
Path scheme: In this scheme two types of weight values are produced depending on the relationship between the latent variables.When a latent variable, say y i , is "causing" another latent variable y j (the so called successor), the weight value corresponds to the linear correlation coefficient r ij = COR(y i , y j ).If instead the latent variable y i is "caused" by another latent variable y j (so called predecessor), the weight is determined using a multiple regression model.In particular, the estimated linear regression coefficient on the predecessor will then be used as the weight.More formally, according to the path scheme the weights are computed as follows where y pred i indicates the set of predecessors of y i and y succ i represents the corresponding set of successors.The coefficient γj corresponds to the estimate of the y j coefficient in the linear regression model assuming y j belongs to the predecessor set of y i .
Step 2: Inner approximation of the latent variable scores.Here, we update the latent variable scores y 1 , . . ., y J obtained in the previous iteration with new ones, y 1 , . . ., y J , which are computed as a weighted sum of their respective adjacent latent variables.More specifically, the inner approximation of the latent variable scores are computed as where the matrix E contains the inner weights as obtained from Step 1.
Step 3: Estimation of the outer loadings/weights.So far we did not make any distinction between reflective and formative measures.On the contrary, we now need to take this difference into account to properly estimate the weights/loadings of the measurement model.
That is, we need to recalculate the latent variables scores obtained from Step 2 using yet another weighting update.The new weights are called loadings when the latent variables are modeled as reflective and just weights when the latent variables are modeled as formative.In the classical algorithm, there are two possible choices for updating the outer weights, which are usually referred to as mode A and mode B. In the marketing literature, mode A refers to a reflective, while mode B refers to a formative measure.
In mode A, we regress each of the indicators onto the corresponding latent variable scores (i.e., using the latent variables included in the indicator block as the predictors of the regression model).Since the latent variables from Step 2 are standardized, the regression coefficients do correspond to linear correlation coefficients, that is In mode B, we regress each latent variable against the indicators in its block.The weights will then correspond to the partial coefficients, that is Step 4: Outer approximation of the latent variable scores.In this step, we estimate the latent variable scores using the weights w j obtained from Step 3 above by computing where W is the matrix that collects all the weights w j , that is .
Step 5: Convergence checking.The process from Step 1 through Step 4 is then repeated until the maximum relative difference between the outer weights from one iteration to the next falls below a given tolerance value chosen by the analyst (e.g., 10

Stage II -Estimation of measurement model parameters
Having estimated the latent variable scores, in the second stage of the PLS-SEM algorithm the loadings for reflective constructs and weights for formative constructs are calculated.These are actually those weights (Equations 5 and 6) at the final iteration.Alternatively, we can use the final latent variables scores ( Y ) predicted after the PLS-SEM estimation to directly compute the loadings, as well as the cross loadings, as the linear correlation between X and Y , and the weights by regressing Y on X.

Stage III -Estimation of structural model parameters
In this stage, using the final latent variable scores, we estimate the structural model parameters (i.e., the path coefficients) for each endogenous latent variable using ordinary least squares according to the PLS-SEM model specified by the researcher.In particular, for each latent variable ( y j ) in the model, the path coefficients are obtained as the regression coefficients on its set of predecessors, denoted below as y pred j , that is COR y pred j , y j .

Bootstrap-based inference
Since PLS-SEM is a distribution-free method, it is not possible in general to get p values and confidence intervals for the model's parameters.For this reason, inference in PLS-SEM is usually conducted by relying on the (nonparametric) bootstrap (Davison and Hinkley 1997).
In the literature on PLS-SEM (see for example Hair et al. 2017), the bootstrap is typically used to estimate the standard errors of the estimated parameters.For example, if one needs to test the null hypothesis that a certain outer weight w is equal to zero in the population versus a two-sided alternative, it is possible to calculate the corresponding test statistic by dividing the weight estimate w based on the original full sample by its standard error estimated using the bootstrap.The test statistic value is then compared with the appropriate t distribution percentile to decide upon the rejection of the null hypothesis.
Bootstrap confidence intervals can be computed as well.Among the many approaches available for finding these intervals, it is usually suggested to use bias-corrected and accelerated bootstrap confidence intervals which adjust for biases and skewness in the bootstrap distribution (Henseler, Dijkstra,

Communality, redundancy, goodness-of-fit, reliability coefficients
Assessment of the model goodness for a PLS-SEM model is rather complicated and not yet properly defined.However, many criteria have been proposed; some of them will be briefly presented below.
In addition to R 2 values, the quality of a PLS-SEM model can be assessed using the redundancy and goodness-of-fit (GoF) indices (Tenenhaus, Esposito Vinzi, Chatelin, and Lauro 2005, pp. 172-173).
To compute the average redundancy, we first need to estimate the average communality, which measures the quality of the measurement model for each latent variable y j , with j = 1, . . ., J.
The communality for block j is computed as the average of the squared correlations between the indicators in the block and the corresponding latent variable, where p j denotes the number of indicators in the jth block and x hj is the hth indicator in the jth block8 .
The average communality is the average of all COR(x hj , y j ) 2 , that is For each endogenous latent variable, redundancy measures the amount of variance in the indicators measuring the variable that is explained by the exogenous latent variables that predict the endogenous variable.For an endogenous block j, it is computed as If more than one endogenous variable is available in the model, then one can also calculate the average redundancy indices for all of the endogenous variables.
Finally, the goodness-of-fit (GoF) index, which takes into account both the measurement and structural model performance, is used for judging the quality of a PLS-SEM model as a whole.
GoF is calculated as the geometric mean of the average communality and the average R 2 : where the average R 2 is computed using only the endogenous latent variables in the model.
A well known theoretical deficiency of PLS-SEM is that it lacks an overall optimization criterion (such as for example the sum of squared residuals in linear regression or the likelihood function in COV-SEM).Therefore, no index for the assessment of the global validation of the model is available.The GoF index represents an operational solution to this problem which is very often used in the practical application of PLS-SEM9 .
In PLS-SEM it is assumed that the block of indicators for a reflective measure is unidimensional in the same sense of factor analysis.To check the unidimensionality of a reflective block, some reliability indexes are typically computed, the most popular ones being the Cronbach's alpha (α j ) and the Dillon-Goldstein's coefficient (ρ j ).When standardized indicators and latent variables are used, these indices are defined respectively10 as and where λ hj is the outer loading for indicator h in block j.Since Cronbach's alpha tends to underestimate the internal consistency reliability, the Dillon-Goldstein's coefficient is often preferred in practice (Chin 1998, p. 320).
For more details on the assessment of PLS-SEM results and rules of thumb for evaluating the quality of a fitted model, we refer the reader to the literature (e.g., an updated and comprehensive survey is available in Hair et al. 2017).

Syntax
The syntax of plssem reflects the measurement and structural part of a PLS-SEM model, and accordingly requires the user to specify both of these parts simultaneously.Since a full PLS-SEM model would include a structural model, i.e., the relationship between latent variables (LV), we need to have at least two latent variables specified in the measurement part.Each latent variable will be defined by a block of indicators (say, indblock).For example, if we have two latent variables in our PLS-SEM model, the plssem syntax requires to specify the measurement part by typing Clearly, one can specify as many LVs as it is needed in the model.The specification of reflective measures in the measurement model require to use the greater-than sign between a latent variable and its associated indicators (e.g., LV1 > indblock1), while the less-than sign needs to be provided when one needs to include latent variables measured in a formative way (e.g., LV1 < indblock1).
One can specify further structural relationships following the same approach.For example, suppose one has two further latent variables in the model, LV3 and LV4, still measured in a reflective way, with LV4 endogenous and LV3 exogenous.Then, the syntax for the structural part should be In addition, in line with most of the Stata commands, we can fit a full PLS-PM model by sub-setting the data directly in the syntax using the if and in qualifiers.
More generally, the syntax for the plssem command is provided by12 [by groupvar :] plssem where square brackets distinguish optional qualifiers and options from required ones, groupvar denotes a variable name in the data set, exp denotes an algebraic expression, range denotes an observation range, and options denotes a list of available options.The optional by prefix causes Stata to repeat a command for each subset of the data for which the values of the groupvar variable are equal.In other words, when prefixed with by, the result of the command will be the same as if one had formed separate data sets for each group of observations, saved them, and then gave the command on each data set separately.The list of available options for the plssem command are illustrated in the next section.

Options
The options allowed by the plssem command are detailed below: wscheme(weighting_scheme) provides the choice of the weighting scheme.The default is path for the path scheme as given in Equation 3. Alternative choices are factorial or centroid for the corresponding scheme.
binary(LV) indicates the latent variables that are defined by a single binary variable.This allows essentially for estimating a model with a binary dependent variable using a logistic regression model.The latent variable LV needs to be specified in the measurement part of the syntax at the same time (e.g., LV > binaryvar) 13 .
boot(#) sets the number of bootstrap replications.
seed(#) sets the seed number for the bootstrap calculations.This option may be useful if reproducibility is one of the analyst's concerns.
tol(#) sets the tolerance value used for checking convergence attainment (see Step 5 in Stage I described in Section 2.1).The default tolerance value is 1e-7.
maxiter(#) indicates the maximum number iterations the algorithm runs.The default is 100 iterations.Note that usually the algorithm requires a very limited number of iterations to reach convergence, typically less than 10.
missing(imputation_method) provides the choice of the imputation method for the indicator missing values.Possible choices are mean (i.e., the mean of the available indicators) or knn (i.e., the kth nearest neighbor method).
k(#) sets the number of nearest neighbors to use with missing(knn).The default number of nearest neighbors is 5.
init(init_method) lets the user choose between two options for initialization.These are indsum 14 (default) and eigen 15 .The eigen option is required if the user wants to estimate only the measurement part of the model 16 .
digits(#) sets the number of decimals to display the model estimates.The default is 3.
13 This is in fact showing how we can work with single indicators using the plssem command.We can include both continuous and dichotomous single indicators in the model by linking them to latent variables in the measurement part of the syntax.Unless any of these latent variables is specified as binary using the binary() option, the structural part will apply linear (regress), otherwise logistic (logit) regression will be used.However, we stress that the same algorithm is used for the measurement part regardless of the nature of the indicators.
14 The initial values in this option are 1s for all of the indicators. 15The initial values (i.e., the weights) in this option are the values associated with the first eigenvector in factor extraction's iterative process.
16 What this initialization does is essentially running separate factor analyses with principal component extraction method (factor, pcf) for each latent variable in Stata.Thus, plssem command can conveniently be used as an alternative to the factor, pcf command as plssem would provide the user with some further estimations (i.e., reliability coefficients and discriminant validity assessment).
noheader suppresses the output header.
nodiscrimtable suppresses the discriminant validity assessment section of the output.
nomeastable suppresses the measurement model section of the output.
nostructtable suppresses the structural model section of the output.loadpval shows the table of loadings' p values.
stats displays some summary statistics (mean, standard deviations, etc.) for each indicator.
group(grouping variable, [suboptions]) provides both the structural and the measurement part of the estimation results for each category of the grouping variable as well as the comparison between the categories based on normal theory (default).As an alternative to normal-based theory estimations, the user can choose between two resampling techniques.More specifically, by adding the suboptions method(permutation) or method(bootstrap) one can get the results based on permutation or bootstrap resampling.The default number of replications for both permutation and bootstrap is 100.However, this can be changed by adding the suboption reps(#).Further, with the suboption groupseed(#) one can also set a certain seed number to be able reproduce the bootstrap or permutation results.Finally, by using the suboption plot one can get a graphical output showing the estimates differences between the groups based on alpha level of 0.05 (default).The significance level can also be changed by adding the suboption alpha(#).
correlate(mv lv cross, cutoff()) lets the user ask for correlations among the indicators or manifest variables (mv), latent variables (lv) as well as cross-loadings (cross) between the indicators and latent variables 17 .When doing so, the user can also set a certain cut-off value for the correlations to be displayed by using the suboption cutoff().For instance, cutoff(0.3)will display the correlations above 0.3 in absolute terms.
rawsum uses the sum of the raw indicators and the resulting aggregated scores (also called summated scales) are used directly for estimating the structural part.In this sense, rawsum is an alternative procedure to the PLS-algorithm for estimating the latent variable scores.
noscale if chosen, the manifest variables are not standardized before running the algorithm.The default is relative.
17 These correlations are computed using the original indicators and estimated latent variable scores.

Postestimation commands
The following are the postestimation commands that can be used after fitting a PLS-SEM model with the plssem command.These commands can basically be categorized under two rubrics, estat and plssemplot.
estat indirect, effects(dep med ind, ...) estimates the specified (standardized) indirect effects and tests the significance of these effects using either the Sobel's z statistic (default) as well as the bootstrap approach18 (Sobel 1982;Baron and Kenny 1986;VanderWeele 2015).The command can estimate up to five different indirect effects at a time.Each of these should be specified by sequentially typing the dependent (dep), mediator (med) and independent (ind) variable from any PLS-SEM model.By adding the suboption boot(#), you can obtain the results based on the bootstrap.To facilitate the reproducibility of results, the suboption seed(#) can further be added to set the seed for the bootstrap calculations.Confidence intervals for the estimated indirect effects are also provided.The default confidence level is 95%, but one can change it by adding the suboption level(#).To change the number of decimals used to display the estimates, one can change the default (3 digits) to another value by adding the suboption digits(#).
estat total produces the decomposition of the total effects in (standardized) direct and indirect effects19 .Adding the suboption plot will generate a bar plot of the effect decomposition.You can here too change the decimals by making use of the suboption digits(#).
estat vif computes the variance inflation factors (VIFs) for the independent variables specified in the structural part of a PLS-SEM model.With the digit(#) suboption, one can change the decimal display.
estat unobshet assesses the presence of unobserved heterogeneity.Currently, the command implements only the REBUS-PLS approach proposed by Trinchera (2007) and Esposito Vinzi et al. (2008).
plssemplot, loadings provides a bar plot of the loadings of indicators for their respective latent variables.
plssemplot, crossloadings provides bar plots of the loadings of indicators for not only their respective but all the other latent variables (i.e., the cross loadings; see line 24 of Algorithm 1).
plssemplot, scores provides the scatterplot matrix of the scores for the latent variables defined in the PLS-SEM model.
plssemplot, stats(LV) provides the scatterplot matrix for the indicators in the block defining the latent variable LV.
plssemplot, innermodel produces a graphical representation of the structural (inner) part of the PLS-SEM model.This command requires the installation of the nwcommands suite20 .
plssemplot, outermodel produces a visualized version of the measurement (outer) part of the PLS-SEM model.This feature is still under development, but will be available soon.
predict, xb residuals creates new variables containing linear predictions (option xb, the default) and residuals (option residuals).These quantities are provided only for reflective blocks of manifest variables in the measurement/outer model and for endogenous latent variables in the structural/inner model.

Stored results
Since plssem is built as a Stata estimation command, many of the results are stored after fitting a model.These objects might be used for further analyses after a model has been fitted.In particular, plssem stores the following objects accessible through the Stata's e() function: • The stored scalar objects are given by: e(N): number of observations.e(reps): number of bootstrap replications.e(properties): choices of initialization, weighting scheme, imputation method, whether the bootstrap has been used, whether the model has a structural part, whether the rawsum option has been used, and whether the manifest variables have been scaled or not.
• The matrix objects saved for later use are: e(assessment): vector of model quality indices, that is the average R 2 , the average communality, the average redundancy and the goodness-of-fit as discussed in Section 2.3.
e(reldiff): vector containing the history of weights' relative differences.
e(imputed_data): matrix of imputed indicators; available only if the missing option has been used.
• Finally, plssem saves a function returning an indicator that marks the observations used for fitting the model; this function is accessible through: e(sample): marks the estimation sample.
Together with the above objects, the plssem command also saves the latent variable scores as new columns in the active data set.These columns are labeled as the latent variables specified in the model syntax.

Additional features
The plssem command is also able to deal with binary latent variables, even when these are used as endogenous in the structural part of the model.This can be achieved by specifying the binary latent variables with the binary() option.In this case, plssem uses the logit command for fitting the logistic regression models having the binary latents as the dependent variable.Even if the corresponding path coefficients cannot be directly compared with those obtained using a linear regression model, for completeness we decided to collect and report all the coefficients in a single table.
The package also has the potential to estimate higher-order construct models entailing higherorder structure (usually second-order) that contains several layers of constructs (Lohmöller 1989, Section 3.5).In particular, one can use the so called repeated indicators approach (Sanchez 2013, Chapter 8) according to which one simply uses the estimated latent variable scores added to the current data set as indicators for the higher-order latent variables.This approach can be easily accomplished with the plssem package.
As a final note, we mention that the current release of the package provides two different approaches to deal with missing values imputation, that is mean and k-nearest neighbors imputation, through the missing() option.Clearly, as most Stata statistical commands do, if the missing() option is not specified, plssem treats missing values by simply disregarding observations with one or more missing values.This trivial approach to missing values is generally known as listwise deletion.We remind that listwise deletion provides unbiased estimates of means, variances and regression coefficients only under the restrictive assumption that the data are missing completely at random (see for example Van Buuren 2012).

Empirical application
In this section we illustrate the use of the plssem package through an example taken from our research agenda.More specifically, we use a real-life data set collected from members of a training/fitness center in 2014 in a medium-sized city in Norway.The members were asked to indicate how well having an attractive face and being sexy described them as a person using an ordinal scale (1 = very badly to 6 = very well).Using a similar scale (1 = not at all important to 6 = very important), the members were also asked to indicate how important each of the following measures was for working out: • to have a good body; • to improve my appearance; • to look more attractive; • to develop my muscles; • to increase my endurance; • to lose weight; • to burn calories; • to control my weight.
Table 3 reports the list of indicators, the variable name in the data set and the corresponding latent construct they measure.

Specification of the PLS-SEM model
Based on relevant evolutionary psychology literature (see for example Markland andIngledew 1997 andKirsner, Figueredo, andJacobs 2003), we propose the following hypotheses:

H1:
The more attractive one perceives herself/himself, the more the person wants to work out to improve her/his physical appearance (i.e., Attractive → Appearance).

H2:
The more the person wants to work out to improve her/his physical appearance, the more s/he wants to work out to build up muscles (i.e., Appearance → Muscle).

H3:
The more the person wants to work out to improve her/his physical appearance, the more s/he wants to work out to lose weight (i.e., Appearance → Weight).

H4:
The more attractive one perceives herself/himself will indirectly influence the person to work out more to build up muscles (i.e., Attractive → Appearance → Muscle).

H5:
The more attractive one perceives herself/himself will indirectly influence the person to work out more to lose weight (i.e., Attractive → Appearance → Weight).
It is usual in SEM-based publications to represent these hypotheses using a path diagram to ease the understanding of the relationships (see for example Jöreskog et al. 2016;Kline 2016).
We do this for our set of hypotheses in Figure 2.

Stored results
plssem stores the following objects in e() after estimating our proposed model.

Multigroup analysis
To demonstrate a further feature of the plssem package, in this section we perform a multigroup analysis based on the model depicted in Figure 2.More specifically, we now check whether the model estimates (path coefficients and loadings) differ between male and female respondents in our sample.As described earlier in the paper, plssem offers two approaches for comparing model estimates across groups: permutation and bootstrap (as well as the standard one based on normal theory).Here, we show the results using the bootstrap option with 200 replications.For reproducibility purposes, we set an arbitrary seed.We also set a significance level (alpha) of 0.1 to display significant path coefficients or loadings in the resulting plot.The grouping variable, women, is a dummy-coded variable in which men are coded as 0.
bootstrapped t test is run based on these estimates.We can conclude that the effect of Appearance on Weight is larger among women than men.This difference is significant at the 0.1 significance level though.The remaining path coefficients are not significantly different between the two groups.Figure 4 reports the plot produced by the above code showing the magnitudes of the differences among the path coefficients.The plot also shows the path coefficients (if any) that are significantly different between groups according to the chosen alpha level by marking them with a star.
The same command also provides the comparison of the model's loadings between men and women represented in Figure 5. None of the loadings is significantly different between men and women.Equality of loadings is indeed an important condition that must be met for establishing measurement invariance before comparing path coefficients of different models.Thus, ideally and as done in real-life research practice, the comparison of the measurement model parameters should precede the comparison of the structural model parameters.

Conclusion
In this article, we introduced the plssem package for estimation of partial least squares structural equation modeling (PLS-SEM).We demonstrated the capabilities of the package using a common and multi-featured empirical application.plssem can as easily be used to estimate more complex PLS-SEM models such as higher-order latent variable models.Future releases of the command will include further more advanced features, in particular we plan to add capabilities for multilevel modeling, more options for missing values imputation and more elaborate approaches for dealing with observed and unobserved heterogeneity.

1 :
Let data X on indicators, measurement and structural model adjacency matrices M and S be given.Choose the latent variables measured in reflective (mode A) and formative (mode B) way.Set the outer weights initial values W old to the zero matrix.Fix the tolerance tol and the maximum number of iterations i max .2: Scale the indicators to have zero mean and unit variance.3: Set the scores initial value to Y = XM .
is in the set of mode A latent variables then 12: convcrit(convergence_criterion) the convergence criterion to use.Alternative choices are relative or square.The former corresponds to max e(iterations): number of iterations to reach convergence.e(tolerance): chosen tolerance value.e(maxiter): maximum number of iterations allowed.e(converged): scalar equal to 1 if convergence is achieved; 0 otherwise.• The stored macros are: e(cmd): this is just the command name, i.e., plssem.e(cmdline): the command as typed.e(estat_cmd): the name of the program used to implement estat.e(predict): program used to implement predict.e(title): title in estimation output.e(mvs): list of manifest variables (indicators) used.e(lvs): list of latent variables used.e(binarylvs): sublist of binary latent variables only.e(datasignaturevars): variables used in calculation of checksum.e(datasignature): the checksum.e(reflective): list of latent variables measured in a reflective way.e(formative): list of latent variables measured in a formative way.e(struct_eqs): equations defining the structural model.

Figure 2 :
Figure 2: The hypothesized PLS-SEM model according to the hypotheses described in Section 4. Attractive, Appearance, Muscle and Weight are the latent variables defined in the model, with Attractive being the only exogenous variable.All the latent variables are measured in a reflective way.

Figure 3 :
Figure 3: Bar chart reporting the outer loadings by blocks.Colors denote different indicator blocks.The dashed line provides a value (i.e., 0.7) frequently used in the literature to assess the quality of the fit.

Figure 4 :
Figure 4: Comparison of path coefficients using multigroup analysis.The statistically significant differences at the given alpha level are highlighted with (*).

Table 3 :
List of indicators collected and latent variables they measure for the empirical application described in Section 4.