Non-Metric Partial Least Squares

: In this paper I review covariance-based Partial Least Squares (PLS) methods, focusing on common features of their respective algorithms and optimization criteria. I then show how these algorithms can be adjusted for use as optimal scaling tools. Three new PLS-type algorithms are proposed for the analysis of one, two or several blocks of variables: the Non-Metric NIPALS, the Non-Metric PLS Regression and the Non-Metric PLS Path Modeling, respectively. These algorithms extend the applicability of PLS methods to data measured on diﬀerent measurement scales, as well as to variables linked by non-linear relationships.


Introduction
Partial Least Squares (PLS) methods encompass a suite of data analysis techniques based on algorithms belonging to the PLS family.These algorithms consist of various extensions of the Nonlinear estimation by Iterative PArtial Least Squares (NIPALS) algorithm, which was proposed by Herman Wold [35] as an alternative algorithm for implementing a Principal Component Analysis (PCA).Wold also proposed using NIPALS to analyze relationships between several blocks of variables [36,38]: this is the PLS approach to Structural Equation Modeling, later called PLS Path Modeling (PLS-PM) [17,29,4].Svante Wold, Herman's son, noticed that the PLS approach could be used in order to implement a component-based regularized regression, called PLS-Regression (PLS-R) [39].
PLS techniques were originally devised to handle data sets forming metric spaces, where variables embedded in the analysis are observed along interval or ratio scales.In this article, variables measured at ratio or interval scale level will be referred to as numeric or metric variables.Unfortunately, in many fields where PLS methods are applied (e.g.genomics, sensory analysis, consumer analysis, marketing) researchers are also interested in analyzing sets of variables measured on a non-metric scale, i.e. ordinal and nominal variables.This article focuses on new methodological proposals to enable PLS techniques able to handle both metric and non-metric variables at once.The next section begins by presenting NIPALS, PLS-R and PLS-PM, before integrating their algorithmic flows and correspondent criteria into a unitary framework.Section 3 explains why PLS methods cannot handle non-metric variables and reviews the remedies for this drawback proposed in the literature to date.Subsequent sections show that PLS iteration can indeed be modified to give PLS-type algorithms, called Non-Metric PLS (NM-PLS), which also work as optimal scaling methods.

A brief overview of PLS algorithms
NIPALS, PLS Regression and PLS Path Modeling (when Mode A option in the estimation process is used, see section 2.3) algorithms share one important feature.Their iterative loops converge to criteria strongly related to the covariance.Given one set of variables, NIPALS loop maximizes the covariance of a weighted aggregate of the variables (called component, or latent variable) with itself (i.e. its variance); given two sets of variables, PLS-R loop maximizes the covariance of a component in a set with another component belonging to the other set; finally, given several sets of variables, PLS-PM loop maximizes the covariance between components belonging to the different sets.The choice to focus on non-metric extensions of PLS algorithms oriented to the maximization of covariance between components, in contrast to PLS algorithms that work on correlation-based criteria, relies on two main reasons.First of all, non-metric extension of a method necessarily requires a parametric expansion that may affect the variability of the estimates.This drawback is greatly reduced by working on covariance-based criteria, which yield more robust estimates.The second reason, as will be shown in the rest of the paper, is that in these frameworks optimal scaling parameter estimates are clearly interpretable.
NIPALS, PLS Regression and PLS Path Modeling algorithms are described in detail below.Since PLS algorithms work on centered variables, variables are assumed to be centered throughout the rest of this section.

PLS approach to PCA: the NIPALS algorithm
The NIPALS algorithm performs a PCA on a set X = [x 1 . . .x p . . .x P ] of variables observed on N individuals.The peculiarity of this algorithm is that it calculates principal components by means of iterated simple ordinary least squares (OLS) regressions.This feature overcomes computational problems caused by missing data and landscape data matrices, i.e. matrices with more columns than rows.
The NIPALS algorithm starts choosing an arbitrary centered vector t (0) (1) , that is a first approximation of the first principal component.Next, a loop which at the s-th iteration (s = 0, 1, . . . ) alternately computes a weight vector w (s) (1) and a score vector t (s+1) (1) is repeated.The loop converges [19] to the first principal component t (1) and to vector w (1) of weights (or loadings), which maximizes the PCA criterion arg max ||w (1) ||=1 var(Xw (1) ). ( It is well known that the eigenvector associated to the dominant eigenvalue of the covariance matrix (1/N )X ′ X is the solution to this problem.Since the dominant eigenvalue of this matrix equals the sum of the squared covariances between the variables and the first principal component, when variables are standardized, criterion (1) can be rewritten as: The second component t (2) is extracted by iterating the same loop on matrix E 1 , which is obtained by deflating X.The deflation step consists of calculating E 1 as the residual matrix of the regression of X-variables on t (1) .In general, for each h (1 ≤ h ≤ H ≤ rank(X)), the NIPALS algorithm maximizes the criterion arg max by implementing the loop described in Step 1 of Algorithm 1 on E (h−1) (with E (0) = X).

PLS approach to multivariate regression: the PLS Regression
Generally speaking, PLS-R involves a set X 2 = [x 12 . . .x p2 . . .x P22 ] of response variables and a set X 1 = [x 11 . . .x p1 . . .x P11 ] of predictors.PLS-R extracts a suite of orthogonal components in the predictor space intended both to explain predictors and to predict response variables.
In the PLS-R algorithm a NIPALS-type loop is iterated to calculate model parameters for each component.Two slightly different algorithms implementing the PLS-R are found in the literature, described in two works by Tenenhaus [28] and Höskuldsson [8] respectively; here Höskuldsson's approach is considered.To compute the first component, the following procedure is repeated upon convergence: proxies of X 2 -scores (t 2(1) ), X 2 -weights (w 2(1) ), X 1 -scores (t 1(1) ) and X 1 -weights (w 1(1) ) are sequentially calculated each one being a function of the previous proxy.The sequences of proxies w (s) 1(1) and w (s) 2 (1) obtained at the s-th iteration converge to the dominant eigenvectors of matrices [8].Therefore, when convergence is reached, vectors w 1(1) and w 2(1) satisfy the criterion arg max The dominant eigenvectors of (1 incide respectively with the dominant right and left singular vectors of matrix (1/N )X ′ 2 X 1 .Therefore, vectors w 1(1) and w 2(1) are also solutions of the criterion arg max This criterion is common to a number of PLS approaches to two-block crosscovariance analysis, such as SIMPLS [9], PLS canonical analysis [38,28] and various approaches, cited in the literature under various names [26,21,34,32], which may be traced back to Inter-Battery Analysis [33].However, these approaches differ between themselves and from the PLS-R in their successive components, which are obtained using different types of deflation.
In PLS-R, both X 1 and X 2 are deflated as a function of t 1(1) , as shown in detail in Steps 2-5 of Algorithm 2, in order to use the second component to explain the portion of variability in responses that is not explained by the first component.Each generic successive component of order h > 1 (1 ≤ h ≤ H ≤ rank(X 1 )) is then calculated by iterating the same loop (Steps 1 of Algorithm 2) on the deflated matrices E (h−1) and F (h−1) respectively (with After extracting a suitable number of components, chosen by cross-validation, a standard regression equation can be calculated in order to explain the responses as a linear combination of the original predictor variables.

PLS approach to Structural Equation Models: the PLS Path Modeling
Structural Equation Models (SEMs) [2] describe and estimate conceptual structures where Q latent variables (LVs) ξ 1 , . . ., ξ q , . . ., ξ Q , linked to each other by a network of linear relationships (usually highlighted by means of a path diagram), are measured by Q sets of manifest variables (MVs) X 1 , . . ., X q , . . ., X Q observed on N observations.Each SEM has two levels of relationships.The first level concerns relationships between LVs (structural model); these relations can be stored in an Q × Q lower triangular binary matrix L, whose generic element l qq ′ is one if ξ q depends on ξ q ′ and zero otherwise.The other level of relations concerns the links between each LV and its own block of MVs (measurement model).According to Lohmöller [17], the PLS approach to SEM is a predictive Path Modeling procedure oriented towards reconstruction of the data matrix, in contrast to the covariance-based approach to SEM by Jöreskog [15], which models relationships between the variables.The PLS-PM estimation procedure is described in detail in Algorithm 3. First, a PLS loop is iterated to obtain LV scores through a system of multiple and simple linear regressions (Steps 1-2).Next, structural (or path) coefficients are estimated through OLS multiple/simple regressions on the LV estimates ξq , according to the path diagram structure (Step 3).In the iterative part of the algorithm (Step 1) outer weights w pq (p = {1, . . ., P q }), linking each MV x pq to the corresponding LV, are estimated by a procedure in which the final estimates of the latent variables are obtained through the alternation of their outer and inner estimations.In the outer estimation phase, a proxy t q for each LV is computed as a linear combination of its own MVs (Step 1.3-1.4).In the inner estimation phase, a proxy z q for each LV is computes as a linear combination of the connected LVs, according to the path diagram structure (Steps 1.1-1.2).In Wold's original algorithm (the one presented in Algorithm 3), inner estimate z (s) ′ < q and is a function of t (s) q ′ if q ′ > q, where q ′ indexes LVs connected to ξ q in the path diagram.Wold's original algorithm [38,6] was simplified by Lohmöller [17].In Lohmöller's algorithm, z q is always a function of t (s) q ′ .This procedure, used in most commercial softwares, is easier to implement, but it takes longer to converge.
The PLS-PM algorithm is extremely flexible.Various options, providing slightly different results, can be used to compute weights for inner and outer estimates of the LVs.There are three ways to calculate the inner weights (e qq ′ ): 1. the centroid scheme (Wold's original scheme), where e qq ′ is equal to the sign of the correlation between t q and t q ′ ; 2. the factorial scheme (the Lohmöller scheme), where e qq ′ is equal to the correlation between t q and t q ′ ; 3. the path weighting scheme, where, for each ξ q : if ξ q ′ is a latent predictor of ξ q , e qq ′ is equal to the coefficient of t q ′ in the regression of t q on the inner estimates of its latent predictors; if ξ q ′ is a latent response variable of ξ q , e qq ′ is equal to the correlation between t q and t q ′ .Two main methods exist to estimate weights in the outer estimation process: Mode A (for outwards directed or reflective blocks) and Mode B (for inwards directed or formative blocks).In Mode B a multiple regression model defines the relation between the latent and manifest variables.The outer weights are thus the regression coefficients of a multiple regression model of the inner estimate of each LV on its own MVs.In Mode A each inner estimate is modeled as a predictor of corresponding MVs.Hence, the outer weights are the coefficients of the simple regressions of each MV on the corresponding LV inner estimate.Therefore, in Mode A PLS-PM, the generic outer weight w pq is a measure of the linear relationship between each x pq and z q , which is a weighted sum of the outer estimates t q ′ of the LVs linked to ξ q .The rest of this discussion focuses on Mode A PLS-PM, i.e. on the case where Algorithm 3 Wold's PLS Path Modeling algorithm.
Step 0: Initialize tq = t (0) q for all q = 1, . . ., Q Step 1: Iteration repeat for all q = 1, . . ., Q do Step 1.1: Computation of the inner weights if q < q ′ then e (s) Step 1.2: Inner estimation of the LVs Step 1.3: Computation of the outer weights Step 2: Computation of the LVs for all q = 1, . . ., Q do ξq = (N 1/2 Xqwq)/ Xqwq end for Step 3: Computation of the Path Coefficients for all j = 1, . . ., J do { Ξ′ →j is the matrix of the latent predictors of the endogenous LV ξj } end for Mode A is used for all the blocks.Mode A PLS-PM does not appear to optimize any criterion: Krämer [10] showed that Wold's Mode A algorithm is not based on stationary equations related to the optimization of a twice differentiable function.However, Tenenhaus and Tenenhaus [31] have recently proposed a new framework, called Regularized Generalized Canonical Component Analysis (RGCCA), where Mode B and a slightly adjusted Mode A, called new Mode A, are unified towards a regularization parameter τ q (0 ≤ τ q ≤ 1), which provides a connection between the two.The new Mode A differs from the original Mode A in normalization constraints: outer estimates are not constrained to unitary variance, while outer weights are constrained to unitary norm (see Algorithm 3).Tenenhaus and Tenenhaus [31] also demonstrated that new Mode A PLS-PM, when centroid or factorial schemes are used, monotonically converges, for each Table 1 Iterative steps of the NIPALS algorithm, PLS-R algorithm (first component) and PLS-PM algorithm with new Mode A option.
normalize w q(1) normalize wq t (1) = Xw (1) t q(1) = Xqw q(1) tq = Xqwq q, to the criterion arg max under the constraint ||w q || = 1.Above, c qq ′ represents the generic element of the binary matrix C = L ′ + L defining the path diagram, and function g(•) depends on the scheme used for the inner estimation of the LVs: it represents the absolute value function for the centroid scheme and the square function for the factorial scheme.Criterion ( 9) can be rewritten as arg max ∀wq q cov(X q w q , z q ) (7) where, for each q, ||w q || = 1 [31].As the next section will show, this criterion can be used to define a unifying framework which includes NIPALS, PLS-R and new Mode A PLS-PM loops, and the corresponding convergence criteria.

A bridge between the three PLS approaches
The algorithmic core of NIPALS, PLS-R and PLS-PM is the iterative process in which weight and score vectors are alternately estimated.Each iteration of a PLS algorithm can be summarized in three steps: a weight estimation step, in which weights are updated; a score estimation step, in which they are linearly combined to build the score vector(s); a normalization step, where the score vector(s) or the weight vector(s) (depending on the algorithm used) are normalized.In Mode A PLS-PM the normalization constraint concerns the score vectors, whereas in NIPALS, PLS-R and new Mode A PLS-PM loops it concerns the weight vectors.
The strong similarity between these last three loops can be observed in Table 1.
In PLS-PM and PLS-R the steps are repeated for each q and q ′ in 1, . . ., Q (in PLS-R Q = 2) with q = q ′ .Moreover, in PLS-PM c qq ′ e qq ′ is null if ξ q ′ is not connected to ξ q ; otherwise, it represents the corresponding inner weight.
In order to optimize criteria (1), ( 5) and ( 7), weights are always worked out in a way that maximizes the squared correlation of the corresponding variables with a latent construct.From now onwards I will refer to this latent construct as Latent Criterion (LC), in contrast to what Hayashi [7] called the outside criterion in his seminal paper on quantification methods.The LC is an unknown vector of order N , centered by construction.For each PLS method different LCs are considered: • In NIPALS, the relevant LC is the first principal component t (1) = Xw (1) .
• In PLS-R the relevant LCs are vector scores t 1(1) = X 1 w 1(1) for the predictor block X 1 and the vector score t 2(1) = X 2 w 2(1) for the response block X 2 .• In (new) Mode A PLS-PM framework, there is one relevant LC for each block of manifest variables X q , i.e. the corresponding inner estimate z q = q ′ c qq ′ e qq ′ X q ′ w q ′ .All of these LCs are then expressed using the generic notation γ = f (w).Through the concept of LC, criteria expressed by equations ( 1), ( 5) and ( 7) can be summed up in one general criterion expressed as a function of γ: under the constraint that ||w q || = 1.Optimization of criterion ( 8) yields the onecomponent NIPALS solution when q ∈ {1}, the one-component PLS-R solution when q ∈ {1, 2} and the new Mode A PLS-PM solution when q ∈ {1, 2, . . ., Q}.

Handling non-metric data in PLS: the statement of the problem
In order to satisfy criterion (8), in NIPALS, PLS-R and new Mode A PLS-PM, when working on standardized variables, optimal weights are calculated as Pearson's product-moment correlation coefficients between each variable and the LC.This leads to two basic assumptions underlying PLS models: • Each variable is measured on a interval (or ratio) scale.
• Relationships between variables and latent constructs are linear and, consequently, monotonic.
Therefore, standard PLS methods cannot handle data measured on a scale which does not have metric properties.
There is a simple way to overcome this problem: replacement each non-metric variable by the corresponding indicator matrix.Most of the softwares currently used to perform PLS analyses use this type of coding in order to handle categorical variables, but it is not a valid solution to the problem considered here, for three reasons: • First of all, a complete disjunctive coding conflicts with the idea of the variable as a whole, because it considers categories as if they were variables in themselves.Any PLS analysis performed on dummy variables yields a weight for each category, rather then for the whole categorical variable.Such weights measure the impact of each individual category on the latent construct, while PLS weights should measure the intensity of the relationship between the original variables and latent constructs.Binary coding makes it impossible to evaluate the importance of the whole variable in the model, or to compare the weight of one variable with the weights of the other variables.
• Secondly, the binary coding inflates the dimensionality of the data matrix, since each categorical variable is transformed into as many binary variables as the number of its categories.As a result, the measurement of a variable's overall impact on the LC is influenced by the number of its categories.Moreover, if the number of categories is very large, binary coding generates sparse matrices.• Finally, the weight of a dummy variable representing a category mainly associated with central values of the corresponding LC score distribution is systematically underestimated, because such a binary variable is linked to the LC by a non-monotonic relationship.
To overcome the drawbacks of binary coding, a new scaling approach has been proposed in a regression framework: the PLS algorithm for CAtegorical Predictors (PLS-CAP) [24,25].This approach was suggested for handling nominal predictors, as well as non-linearity, in a PLS Regression.In a Path Modeling framework, Jakobowicz and Darquenne [14] proposed a modified PLS-PM algorithm, called Partial Maximum Likelihood (PML), that they recommend for cases where some block are made up of both numeric and nominal MVs.Yet no comprehensive approach seems to exist for analysis of variables measured on a variety of measurement scales in a more general PLS framework.

Non-Metric PLS approach
The Non-Metric approach for handling measurement heterogeneity in PLS framework (NM-PLS) is based on the concept of Optimal Scaling (OS).OS has been extensively implemented in multivariate analysis by iterative algorithms belonging to the Alternating Least Squares (ALS) family [16,5]: for this reason, these algorithms are also called ALSOS (Alternating Least Squares approach to Optimal Scaling) [41].The OS principle sees observations as categorical, and represents each observation category by a scaling parameter.This parameter is subject to constraints deriving from the measurement characteristics of the variables.In the ALS approach parameters are divided into two subsets: the parameters of the model and the parameters of the data (or scaling parameters).A loss function is then optimized by alternately optimizing with respect to each subset, keeping the other fixed.NM-PLS algorithms exploit a PLS-type iteration to implement an OS procedure.This leads to a new class of PLS algorithms that generalize standard PLS methods to the treatment of non-metric variables.They are called Non-Metric PLS (NM-PLS) methods [23], because they are able to provide data with a new metric structure, that does not depend on the metric properties of the original data.In other words, NM-PLS methods provides non-metric data with a metric, and provide metric data with a new metric, making relationships between variables and latent constructs linear, as required in standard PLS models.These methods could also be named non-linear PLS methods as well, since they discard the linearity assumption intrinsic to the standard PLS methods.However, the adjective Non-Metric is preferred because it highlights their suitability for data without a metric structure.
In the rest of this article, I will refer to a variable x * which has been observed for N units on a given measurement scale and needs to be provided with a metric as a raw variable.In the OS process a scaling (numeric) value is assigned to each category (or distinct value) φ k (k = 1, . . ., K ≤ N ) of x * , such that • it is coherent with the chosen scaling level; • it optimizes the model criterion.
Each raw variable is transformed as x ∝ Xφ, where φ ′ = (φ 1 , . . ., φ K ) is the vector of optimal scaling parameters and the matrix X defines a space in which constraints imposed by the scaling level are respected.The symbol ∝ means that the left side of the equation corresponds to the right side normalized to unitary variance.
In order to optimize NM-PLS model criteria, for any raw variable x * in the model, the correspondent scaling vector must satisfy the following criterion: Criterion ( 9) is optimized by means of the ordinary least squares regression coefficients of the LC on X, i.e. by projecting γ x * on the space defined by the columns of X.The resulting projection, normalized to unitary variance, is the geometric representation of the scaled variable x.Three levels of scaling are considered here: nominal, ordinal and polynomial.Each level of scaling has a corresponding ad hoc scaling function Q, which is the projection operator of the LC in a suitable space spanned by X-columns.While nominal and ordinal scaling involve the quantification of numerals (i.e.numeric labels with no quantitative meaning), polynomial scaling exclusively addresses non-linearity, as it involves the transformation of a metric raw variable.The following explanations refer to Q as either a quantification or a transformation function, depending on the type of scaling.It is noteworthy that other smoothing procedures involving an orthogonal projection on a predefined basis (a spline basis, for instance) could be used as well.
In nominal scaling, a variable is quantified as the orthogonal projection of the LC γ x * linked to x * on the space spanned by the columns of the indicator matrix Xn ("n" standing for nominal) generated by the K categories of x * : The scaling function Q( Xn , γ x * ) maximizes ( 9) under the grouping constraint that, for each pair of observations i and i ′ , where the symbol ∼ indicates membership of the same category.The resulting scaling values for the categories of x * are the K least squares regression coefficients of Xn on γ x * , which correspond to the averages of γ x * conditioned to categories of x * .The scaled variable contains the LC values predicted by the regression of γ x * on Xn .Ordinal scaling may involve ordinal or metric variables, since it retains the order property of x * .In ordinal scaling the following scaling function [40] is used: where X o ("o" standing for ordinal) is constructed by Kruskal's secondary least squares monotonic transformation [11] of x * .Kruskal's up-and-down block algorithm, also known as the pool-adjacent-violators algorithm, may be implemented in order to obtain Xo [12].The vector of the regression coefficient ( Xo ′ Xo ) −1 Xo ′ γ x * contains the optimal scaling values which preserve the order of the categories of x * , as required by the conditions where the symbol ≺ indicates empirical order.Nominal and ordinal quantification functions provide easily and clearly interpretable scalings, due to the fact that in a PLS framework the weight of a variable is a function of its correlation with the corresponding LC.When function Q( Xn , γ x * ) is used, the relationship between γ x * and x * in terms of linear correlation can be expressed as the Pearson's correlation ratio η γ x * |x * : Since 0 ≤ η ≤ 1, this correlation will never be negative.This implies that the relation between a variable generated by a nominal scaling function and the LC can be interpreted in terms of intensity, but not in terms of sign.This makes sense, as it is a conceptual error to expect a sign in the covariation between a numerical variable (the LC) and a nominal variable, even after providing the nominal variable with a metric: by definition, this type of variable neither increases nor decreases.When using Kruskal's secondary transformation, instead, the weight of the scaled variable indicates the extent to which relationship between x * and the LC approaches monotonicity, as it is closely related to Kruskal's badness-of-fit STRESS index [11] by the following equations: The STRESS index is bounded between zero and one.It expresses the departure of the relationship between x * and γ x * from the assumption of monotonicity.The sign of cor(γ x * , x) depends on the type of monotonic transformation is applied to x * , which can be increasing or decreasing.Therefore, if cor(γ x * , x) = 1 a perfect increasing monotonic relationship exists between γ x * and x * , while if cor(γ x * , x) = −1 this relationships is perfectly decreasing and monotonic.
Advance knowledge of the degree D of a polynomial relationship between a raw numerical variable and the LC makes it possible to use the polynomial scaling.According to Young [41], optimal parameters for the polynomial transformation are found by projecting γ x * on the space spanned by the columns of matrix Xp ("p" standing for polynomial).Matrix Xp is built with one row for each observation and with D + 1 columns, each column being an integer power of the vector x * : Assuming that the raw variable and the LC are linked by a linear relationship, all that is required is to set D = 1.If this is the case for all of the variables, NM-PLS methods provide the same results as standard PLS methods applied to standardized data.It is noteworthy that scaling functions ( 10), ( 12) and ( 15) cannot be directly applied to raw variables, as the LC is unknown by definition.NM-PLS algorithms use PLS iteration in order to overcome this issue: model and scaling parameters are alternately estimated in a modified PLS-type loop with an added quantification step.In standard PLS steps the model parameters are estimated for given scaling parameters, while in the quantification step scaling parameters are estimated for given model parameters and raw variables are suitably transformed into interval variables normalized to unitary variance.
The Non-Metric approach to NIPALS, PLS-R and (new) Mode A PLS-PM algorithms are described in detail in the following sections.

A PLS algorithm for non-metric PCA: Non-Metric NIPALS
Non-Metric NIPALS (NM-NIPALS) performs a non-metric PCA on a matrix X * representing a set of P raw variables observed on N units.In particular, it yields a matrix X = [x 1 . . .xP ] of standardized scaled variables xp ∝ Xp φ p satisfying criterion (2), to which scaling parameters are added.From a mathematical point of view, the NM-NIPALS criterion can be expressed as arg max ∀φ p ,w (1) p cor 2 ( Xp φ p , Xw (1) ) (16) under the constraints w (1) = 1 and var(x p ) = 1.A further restriction is given by matrices Xp , defined in advance for each variable according to the chosen scaling level.For fixed t (1) = Xw (1) , criterion ( 16) is separable with respect to the optimally scaled data for each variable and can be considered as a sum of criteria each of which is a function of the vector φ p only.Therefore, the scaling parameters can be identified by separately maximizing ∀p ∈ {1, . . ., P } arg max This criterion is optimized by the scaling functions Q( Xp , t (1) ) discussed in section 4.
NM-PLSR scalings are optimal in the sense that they optimize the single component PLS-R criterion arg max under the constraints w 1(1) = w 2(1) = var(x p1 ) = var(x p2 ) = 1.It is noteworthy that, since this criterion involves only the first component, it can also be applied to the other PLS approaches to two-block cross-covariance analysis cited in section 2.2.Criterion ( 18) depends on two sets of parameters.The first set consists of the model parameters, constrained to unitary norm.The other set consists of scaling parameters, which must respect restrictions due to the scaling level chosen for each variable and to the standardization constraint applicable to the scaled variables.
For fixed scaling parameters, the NM-PLSR optimization problem becomes arg max Optimal w 2(1) and w 1(1) are respectively dominant right and left singular vectors of matrix (1/N ) X′ 2 X1 (see section 2.2).Therefore, in order to optimize criterion (19), conditions and must be respected.When the other parameters remain fixed, the response scaling parameters must optimize the following criterion: Since var(x p1 ) = 1 and var(t 2(1) ) is fixed with respect to the sum, criterion (22) can be rewritten as arg max This function is separable with respect to each φ p1 , and it can be considered as a sum of P 1 components, each of which is a function of the scaling parameters of one variable only: Since optimizing each criterion ( 24) is equivalent to optimize criterion (22), scaling parameters can be independently calculated as OLS coefficients of the regressions of t 2(1) on each matrix Xp1 .Hence, each optimally scaled predictor xp1 can be calculated as Specular reasoning can be used to find the optimal quantifications for response variables, leading, for each p ∈ {1 . . .P 2 }, to the criterion arg max which is satisfied by This specular treatment is due to the fact that the single component PLS-R model is symmetric.The PLS-R model becomes asymmetric only when successive latent dimensions are computed by deflating both predictors and responses with respect to the components in the predictor space.
In the NM-PLSR algorithm conditions ( 20), ( 21), ( 25) and ( 27) are integrated into a modified PLS-R loop which starts by quantifying each predictor through the quantification functions Q( Xp1 , t 2(1) ) (Step 1.1.1 of Algorithm 5), so as to get a first approximation of the matrix of the quantified predictors X1 .Once this is done, w 1(1) is calculated as a function of X1 and t 2(1) (Step 1.1.2).Then, t 1(1) is computed as a function of X1 and w 1(1) (Step 1.1.3),and X 2variables are quantified by means of the quantification functions Q( Xp2 , t 1(1) ) (Step 1.1.4).The weights w 2(1) are then computed as a function of X2 and t 1(1) (Step 1.1.5).Finally, the vector t 2(1) is computed as a function of X2 and w 2(1) (Step 1.1.6).This loop is repeated until convergence is reached.In each step of the loop, criterion ( 18) is maximized with respect to one of the parameters, keeping the others constant.
Since in each loop of the iterative process the criterion increases, convergence is reached when a maximum is encountered.The MN-PLSR algorithm continues

Non-Metric PLS Path Modeling
The basic principles of the NM-PLSR algorithm can be extended to the PLS approach to predictive Path Modeling, as the single component PLS-R algorithm is the same as the two-block PLS-PM algorithm, except for normalization constraints.In the PLS-PM vocabulary, t 2(1) can be interpreted as the outer estimate of the LV associated with block X 2 (t 2(1) = X 2 w 2(1) ), as well as the inner estimate of the LV underlying X 1 by means of which the outer weights (w 1(1) = X ′ t 2(1) / X ′ t 2(1) ) are calculated.Symmetrically, t 1(1) can be considered simultaneously as the outer estimate of the LV underlying X 1 and the inner estimate of the LV underlying X 2 .These double functions are justified by the inner linear relationship t 2(1) = αt 1(1) .This is a hidden step in the PLS-R algorithm, which, in PLS-PM terms, can be interpreted as: the outer estimate of the LV in one block is used the as inner estimate of the LV in the other block.
In view of these considerations, it can been stated that in the NM-PLSR model, interpreted as a two-block path model, any quantified variable is computed as a function of the inner estimate of the corresponding LV.Therefore, the non-metric extension of the PLS-R algorithm can also be applied to PLS-PM algorithm by adding to the PLS-PM loop a quantification step in which any scaled MV is computed as a function of the inner estimate of the corresponding LV.
The NM-PLSPM algorithm is described in Algorithm 6.It starts by initializing LV outer estimates by means of centered vectors t (0) q .A first approximation of the inner weights (Step 1.1) and of the LV inner estimates z (0) q (Step 1.2) are then computed as functions of t (0) q .The LV inner estimates are in turn used for computing a first quantification x(0) pq of the MVs by means of the functions Q( Xpq , z (0) q ) (Step 1.3).Parameter Xpq of the scaling functions depends on the scaling level defined in advance for each MV.Afterwards, a first estimate of the outer weights is computed as a function of z (0) q and X(0) q (Step 1.4).Finally, a new proxy of LV outer estimates t q is computed by Mode A estimation mode (Step 1.5).This loop is repeated upon convergence.It yields final outer weights, which are used to calculate the final LV scores (Step 2).Path coefficients are then calculated as in the standard PLS-PM algorithm.

The optimizing criterion of Non-Metric PLS Path Modeling
Unfortunately, Mode A NM-PLSPM suffers from the same drawbacks as Mode A PLS-PM.That is, since the criterion to which the Mode A PLS-PM algorithm converges is unknown, it is not possible to state that scalings provided by the Mode A NM-PLSPM algorithm are mathematically optimal with respect to the model.However, this problem disappears if the new Mode A scheme is used, i.e. by implementing a RGCCA with the regularization parameter τ q = 1 for all the blocks [31].
As seen in section 2.3, new Mode A PLS-PM has recently been shown to optimize criterion (7) when a centroid or factorial scheme is used.In the non- under the constraints ||w q || = 1 and var(x pq ) = 1.For fixed z q , criterion (28) consists of a sum of p× q criteria, each of which is a function of a scaled variable.Hence, it can be optimized by separately maximizing ∀p, q arg max by means of the scaling functions Q( Xpq , z q ).This implies that when new Mode A is used, the quantification step 1.3 in Algorithm 6 optimizes model criterion with respect to scaling parameters for fixed model parameters.Therefore, a procedure that alternates the PLS-PM iterative estimation (for fixed scaling parameters) and the quantification step (for fixed model parameters) is expected to converge to a maximum for criterion (7), as in each of the two steps criterion (7) cannot decrease.However, in order to avoid unnecessary computations, in the NM-PLSPM algorithm the quantification step is directly inserted into the PLS loop, exactly as in Algorithm 6.

Missing data
Standard OS algorithms apply different solutions for handling missing data.Gifi [5] distinguishes among three options.In the first option (missing data passive) the row of X is a zero row if the corresponding observation is missing for the variable x * ; this option corresponds to assigning that category a zero scaling value.The second option is called missing data multiple category; it implies that a new category is added, to which all missing terms belong.The third option is the missing data single category; it means that for each missing term a new category is added.All these solutions require advance imputation of missing data, and they can also be implemented in an NM-PLS framework.However, in the PLS framework a fourth option is possible: data pre-handling can be avoided, since PLS algorithms need no advance imputation of missing data.In PLS procedures, each score, as well as each weight, is calculated by means of an iterated sequence of dot products.In the case of missing data, dot products are obtained by pairwise deletion, that is by summing products calculated on available pairs.This characteristic of standard PLS algorithms is retained in NM-PLS quantification procedures, where missing values are left missing and available data are quantified.

An application to macroeconomic data
The data-set for this example consists of 9 macroeconomic variables observed for 47 countries.Three variables measure the inequality of land distribution: • gini: Gini's index of concentration; • farm: complement of the percentage of farmers that own half of the lands, starting with the smallest ones.Thus if farm is 90%, then 10% of the farmers own half of the lands; • rent: percentage of farm households that rent all their land.Two variables measure industrial development: • gnpr: gross national product pro capite (in U.S. dollars) in 1955; • labo: the percentage of labor force employed in agriculture.
Four variables measure political instability: • inst: an index, bounded from 0 (stability) to 17 (instability), calculated as a function of the number of the chiefs of the executive and of the number of years of independence of the country during the period 1946-1961; • ecks: the Eckstein's index, which measures the number of violent internal war incidents during the same period; • death: number of people killed as a result of violent manifestations during the period 1950-1962; • demo: a categorical variable that classifies countries in three groups: stable democracy, unstable democracy and dictatorship.
These data were collected and analyzed by Russett [22] in order to assess the hypothesis that the political instability of a country depends on the inequality of land distribution and the industrial development.Years later, the same data-set was analyzed by Gifi [5], Tenenhaus [28] and Tenenhaus and Tenenhaus [31].Gifi used an optimal scaling technique to address the high degree of non-linearity of data, scaled in such a way as to maximize the canonical correlation between the variables reflecting the inequality of land distribution and the industrial development and the variables reflecting political instability.Indeed, Gifi himself noticed that partitioning data in three sets of variables (agricultural inequality, industrial development and political instability) would have been a more rational approach.Starting from this idea, Tenenhaus [28] analyzed the data-set in a multi-block framework; he addressed non-linearity by approximating Gifi's scalings by means of a priori monotone functional transformations and handled the variable demo by splitting it into three binary variables.However, Tenenhaus [28] pinpointed that approximating Gifi's transformations is not the best choice, while transformations optimized for PLS-PM would be preferable.
Here two approaches are compared.In the first one the standard PLS-PM algorithm is run on the original data1 , and the nominal variable demo is split into  three binary variables.In the second one the NM-PLSPM algorithm is run, in order to obtain quantifications coherent with the path model.The quantitative MVs are transformed at the ordinal scaling level to discard the hard assumption of linearity in favor of a milder assumption of monotonicity, and the categorical MV demo is analyzed at the nominal scaling level.In both the cases, according to Tenenhaus' previous works, Agricultural Inequality and Industrial Development have been modeled as predictors of Political Instability, as shown in Figure 1; the option centroid and both Mode A and new Mode A estimation procedures are used.Since the two estimation procedures provide outputs identical up to the second decimal digit, the outputs presented in Table 2 can be referred to both, and only the differences between the standard and the non-metric PLS models will be commented.The interpretations of the two inner models are similar: industrial development is the main driver for political instability, on which it has a negative impact.As one can expect, agricultural inequality positively impacts on political instability.However, the criterion optimized by the PLS iteration increases from 2.29 to 3.42.The improvement is due to the monotone transformations of the quantitative MVs (shown in Figure 2), and it is reflected in both the inner and the outer models.The LVs Agricultural Inequality and Industrial Development are more strongly related to Political Instability in NM-PLS model than in the standard PLS model, providing a greater explained variance (the R 2 index increases from 0.60 to 0.76).Moreover, the LVs are on average more correlated to the corresponding MVs; the manifest variables rent, inst, ecks and deat show sensibly increased loadings, due to their marked nonlinearity.
The comparison of the two models (Table 2) suggests two further remarks: • In this example the introduction of the scaling parameters did not affect the stability of the estimates and allowed to discover a statistically significant loading for variable inst; • In non-metric approach the loading of the MV demo clearly points out a strong relation with the LV Political Instability.On the contrary, in the standard PLS model the three dummy variables labeled as stab, unstab and dict show quite different loadings, which do not make easy the interpretation of the role of the variable demo as a whole.Indeed, there is a strong relation between Political Instability and all of the binary variables representing the categories of MV demo.However, while relations between binary variables dict and stab and Political Instability are pretty monotone (and so they can be easily detected in a linear framework), the binary variable unstab is non-monotonically related to political instability and consequently its importance is underestimated.This drawback of the binary coding has been already discussed in section 3.

Discussion
Ten years after Herman Wold proposed the Nonlinear Iterative LEast Squares (NILES, later renamed NIPALS) procedure [35], an iterative algorithm based on an Alternating Least Squares (ALS) procedure [16] was proposed by Jan de Leeuw, Yoshio Takane and Forrest W. Young for implementing optimal scaling in additive structure analysis.The authors themselves noticed, in respect to ALS, that "this type of procedure is philosophically much like the NILES/NIPALS procedure developed by Wold and his associates with the distinction that Wold is usually concerned with optimizing only model parameters" [16].Over the last 35 years these two procedures have been developed independently for completely different aims.NIPALS-based algorithms have been developed for implementing a number of well-known two-and multi-block analysis methods (see [30] for a review), as well as two new approaches: a componentbased regularized regression (the PLS-R) and a predictive Path Modeling in a soft modeling framework (the PLS-PM).Meanwhile, ALS has become the most widely used procedure for optimal scaling in joint non-parametric multivariate analysis of non-metric and metric data.A whole system of non-linear multivariate analysis, based on ALS principles, has been developed by the data theory group of Leiden University [5].This article shows that PLS algorithms, correctly adjusted, can also work as optimal scaling algorithms.This new and previously totally unexplored feature of PLS make it possible to devise one, two and multiblock PLS methods with optimal scaling features.NM-PLS methods preserve all the features of standard PLS algorithms, since • they work on landscape matrices; • they handle missing data without the need for advance data imputations; • they do not require hard assumptions about population distribution.
Moreover, with NM-PLS methods it is possible to: • introduce into the models variables observed on different measurement scales; • investigate non linearity; • discard the hard assumption of linearity in favor of a milder assumption of monotonicity.
NM-PLS scalings have been proved to be suitable, coherent and optimal.They are suitable because they respect the relevant constraints depending on which properties of the original measurement scale are to be preserved.They are coherent because the statistical relationship of a latent construct with the associated raw variable is reflected into its statistical relationship with the corresponding quantified variable.Finally, they are optimal because they optimize the criterion of the model in which they are involved.However, NM-PLS quantifications are optimal with respect to the first component, while the scaling parameters optimality in multiple component models is not assured.This is not a relevant drawback in PLS-PM framework, where multidimensional latent variables are rarely investigated, but it represents a restriction in Non-Metric PLS approaches to regression and Principal Component Analysis (PCA), which require multicomponent models in most of the applications.The impossibility to search for quantifications optimal with respect to multi-component models is due to the fact that the PLS algorithms extract components sequentially, by alternating an estimation step and a deflation step.As a consequence, the extraction of each successive component is conditioned to the knowledge of the previous ones.This procedure is efficient in standard PLS algorithms because PLS solutions are nested, but it prevents from ensuring the optimality of the parameters in multi-component non-metric models, as NM-PLS solutions are not nested.NM-NIPALS, NM-PLSR and NM-PLSPM algorithms are implemented in R routines [20] which are available from the author upon request.

Fig 1 .
Fig 1.The path model.Drawing conventions: outward directed arrows joining LVs (in ellipses) to MVs (in boxes) indicate that (New) Mode A estimation mode is used in the algorithm; the arrows linking LVs indicate the dependence flow which specifies the inner model.

Fig 2 .
Fig 2. Variable original values plotted versus corresponding optimal scaling values.

Table 2
[3]ameters of the inner model (path coefficients β) and of the outer model (loadings λ) in standard and Non-Metric PLS path models, with corresponding 95% confidence intervals (C.I.) built by means of 2000 bootstrap samples[3]