Robustifying principal component analysis with spatial sign vectors

In this paper, we apply orthogonally equivariant spatial sign covariance matrices as well as their aﬃne equivariant counterparts in principal component analysis. The inﬂuence functions and asymptotic covariance matrices of eigenvectors based on robust covariance estimators are derived to compare the robustness and eﬃciency properties. We show that especially the estimators that use pairwise diﬀerences of the observed data have very good eﬃciency properties providing practical robust alternatives to classical sample covariance matrix based methods.


Introduction
Principal component analysis (PCA) is widely used for finding lower dimensional structure in data, and is commonly applied to high-dimensional data. PCA represents data by a smaller number of components that account for the variability in the data. This dimension reduction step can be followed by other multivariate methods, such as regression, discriminant analysis, cluster analysis, etc.
In classical PCA the sample mean and the sample covariance matrix are used to derive the principal components. These two estimators are highly sen-sitive to outlying observations, and classical PCA becomes unreliable, when outliers are encountered. Several robust competitors to classical PCA estimators have been proposed in the literature. A natural way to robustify PCA is to use robust location and scatter estimators instead of the sample mean and sample covariance matrix when estimating the eigenvalues and eigenvectors of the population covariance matrix. See e.g. Devlin et al. (1981). Locantore et al. (1999) and Marden (1999) proposed a simple robust alternative to PCA. In spherical PCA, the data are first transformed to multivariate spatial sign vectors. These are simply defined to be the vectors of unit length that point into the direction of original data points. The sample covariance matrix of spatial sign vectors, that is, the spatial sign covariance matrix (Visuri et al., 2000), is then used to obtain the eigenvectors of interest.
As spherical PCA uses only information on directions of the data points, its efficiency properties are poor especially under the multinormal model. To obtain a more efficient estimator, the symmetrised spatial sign covariance matrix (Visuri et al., 2000) can be used. This estimator is defined as the spatial sign covariance matrix based on pairwise differences of the observed data. The use of symmetrised spatial sign covariance matrix for PCA was discussed in Visuri et al. (2000), but its theoretical properties have not yet been studied.
In this paper, we compare different spatial sign covariance matrices as well as their affine equivariant counterparts, namely, the Tyler's shape matrix (Tyler, 1987) and the Dümbgen's shape matrix (Dümbgen, 1998) in a PCA framework. We assume that the number of observations in data is greater than the number of variables, so n > p. As shown by Tyler (2010), the estimation of robust affine equivariant scatter matrices in case p > n is meaningless as resulting scatter matrices will be proportional to the sample covariance matrix. Notice that the spatial sign covariance matrices are not affine equivariant, and can be easily computed for data with more variables than observations. Applications are given for example in Locantore et al. (1999) and Gervini (2008). The paper is organised as follows. In Section 2, PCA based on different spatial sign covariance matrices is considered. In Section 3, influence functions and asymptotic efficiencies are studied. We show that symmetrised estimators, in particular, have very good efficiency properties thereby providing practical and robust alternatives to classical sample covariance matrix based methods. The paper concludes with some final remarks in Section 4. 2 2. PCA based on sign covariance matrices

Notation
Let x be a random vector from a p-variate distribution with symmetry center µ and positive definite and symmetric p × p (P DS(p)) covariance matrix Σ. The spectral decomposition of the population covariance matrix will be denoted by Σ = ΓΛΓ T , where Γ is an orthogonal matrix containing the eigenvectors γ 1 , . . . , γ p of Σ as columns and Λ is a diagonal matrix with corresponding eigenvalues λ 1 > · · · > λ p > 0 as diagonal values. We may also write where Det(Σ) is the Wilks' generalized variance, and Λ Σ is a diagonal matrix of standardized eigenvalues The geometric mean of the standardized eigenvalues thus equals one. The parameters [Det(Σ)] 1/p , Λ Σ and Γ define the scale, shape and orientation of the data cloud. See for example Paindaveine (2008) and Frahm (2009).
Let F x be the cumulative distribution function of x. Then C(F x ) is a scatter matrix functional if it is P DS(p) and affine equivariant in the sense that where A is any nonsingular p × p matrix and b is any p-vector. We then write for the spectral decomposition of C(F x ), where P C = P C (F x ) is an orthogonal matrix that contains the eigenvectors p C,1 , . . . , p C,p of C(F x ) as columns, and Λ C = Λ C (F x ) is a diagonal matrix of the corresponding eigenvalues arranged in decreasing order λ C,1 > · · · > λ C,p . We call V (F x ) a shape matrix functional if it is P DS(p) with Det(V ) = 1 and affine equivariant in the sense that 3 The spectral decomposition of V (F x ) can be written in a similar fashion to that of C(F x ). For a detailed discussion of scatter matrix and shape matrix functionals, see Oja (2010) and references therein. Finally, notice that in robustness and efficiency studies we assume the elliptical model F , that is, the density function of a random vector x ∼ F is of the form where f 0 (z ′ ) = exp{−ρ(||z ′ ||)} and function ρ defines the distribution of the standardised variable z ′ = Σ −1/2 (x−µ) (in this paper, the square root matrices Σ −1/2 and Σ 1/2 are taken to be symmetric). To fix Σ, we assume that ρ is such that E[||z ′ || 2 ] = p. Notice also that under the elliptical model, different shape matrices estimate the same population quantity V = [det(Σ)] −1/p Σ. The orthogonal transformation z = Γ T (x − µ) defines variables from a centered distribution with covariance matrix Λ. Throughout the paper this distribution is denoted by F Λ .
2.2. PCA based on spatial sign covariance matrices Locantore et al. (1999) and Marden (1999) introduced a simple robust alternative to classical principal component analysis based on spatial sign vectors. Write the spatial sign function as where ||x|| denotes the Euclidean norm of x. The spatial sign function thus projects data points onto the unit sphere making resulting estimators highly robust. The population spatial sign covariance matrix is defined as where µ(F x ) is the population spatial median that solves E[S(x − µ(F x ))] = 0. Recall that the spatial median is a higly robust rotation equivariant estimator of symmetry center µ. It has a 50% breakdown point and a bounded influence function. A fast algorithm for computing spatial medians is given in Vardi and Zhang (2001).
In spherical principal component analysis, the matrix of eigenvectors, P S = P S (F x ), and the diagonal matrix Λ S = Λ S (F x ) of the eigenvalues of SCov(F x ) are computed as in (3). The robust principal component scores are then given by z = P S T (x − µ(F x )). Croux et al. (2002) examined efficiency properties of the eigenvectors based on the spatial sign covariance matrix, and showed that especially for multivariate normal models, the efficiencies are very low. This is due to the fact that the spatial sign covariance matrix only uses information on the directions of the data points. Motivated by these results, in the following we consider principal component analysis based on more efficient covariance estimators.
Define the population symmetrised spatial sign covariance matrix as (Visuri et al., 2000). Since SSCov(F x ) is based on pairwise differences, it is computed without the need for the location functional. Via pairwise differences one also uses information on distances of the data points from the center of data. This feature makes the symmetrised spatial sign covariance matrix more efficient than the spatial sign covariance matrix. The eigenvectors and eigenvalues are now obtained from the spectral decomposition of SSCov(F x ) and denoted by P SS = P SS (F x ) and Λ SS = Λ SS (F x ). The principal component scores can be computed as before using e.g. spatial median. The next theorem summarizes the relationship between the eigenvectors and eigenvalues of the population covariance matrices (6) and (7) with those of Σ. A proof is given in the Appendix.
Theorem 1. Let F be an elliptical distribution with location center µ and covariance matrix Σ = ΓΛΓ T . Then u = (u 1 , . . . , u p ) T is uniformly distributed on the unit sphere, and λ Σ,i s are the standardised eigenvalues of Σ as defined in (1).
Under the elliptical model, the eigenvectors of the population spatial sign covariance matrices therefore coincide with those of Σ and the eigenvalues are related via (9). The relationship between the eigenvalues implies that it is possible to obtain an algorithm for computing estimates for Λ Σ starting from Λ S : Let Λ (0) Σ be some initial value for Λ Σ , e.g. Λ (0) Σ = I p . Then the iteration step suggested by (8) is The result is scaled so that det(Λ (k+1) Σ ) = 1. In practice, an estimate for Λ Σ is obtained by replacing the above expectation by the sample average.
In Visuri et al. (2000) statistical properties of spatial sign covariance matrices were studied. Corresponding matrices are not genuine scatter matrices as they satisfy (2) only when A is an orthogonal p×p matrix. However, in the context of PCA, this orthogonal equivariance property suffices, since it yields principal component scores that are invariant under orthogonal and location transformations. The main advantage of spatial sign covariance matrices is that they are very easy and fast to compute in practice.

PCA based on affine equivariant spatial sign covariance matrices
Hettmansperger and Randles (2002) proposed affine equivariant counterparts of spatial median and spatial sign covariance matrix. In their approach, the original data points are first transformed to y = V (F x ) −1/2 (x − µ(F x )), where µ(F x ) is a p-vector and V (F x ) is a shape matrix. Then the affine equivariant counterparts of the spatial median and spatial sign covariance matrix are those µ(F x ) and V (F x ) that are found as the solutions of the following M-estimation equations E[S(y)] = 0 and p E[S(y)S(y) T ] = I p .
The resulting functional V (F x ) is known as Tyler's (1987) shape matrix and the location functional µ(F x ) is the transformation retransformation spatial median using Tyler's shape matrix (Hettmansperger and Randles, 2002). In the following we will denote V (F x ) = SCov 2 (F x ). The affine equivariant counterpart of the symmetrised spatial sign covariance matrix is obtained similarly to the above: this time we use pairwise differences of transformed observations, that is, 6 The resulting functional V (F x ) is known as Dümbgen's (1998) shape matrix, which we denote by SSCov 2 (F x ). Note that, since Dümbgen's estimator uses differences of transformed observations, it avoids estimation of the location parameter. As in (3), the spectral decomposition of V (F x ) can be used to compute the eigenvectors and eigenvalues of different shape matrices, and the robust principal component scores. It then follows from (4) that under the elliptical model, SCov 2 (F x ) = ΓΛ Σ Γ T and similarly for SSCov 2 (F x ). The eigenvectors and eigenvalues of SCov 2 (F x ) and SSCov 2 (F x ) therefore coincide with the eigenvectors and standardised eigenvalues of Σ as defined in (1). Robustness and asymptotic properties of affine equivariant spatial sign covariance matrices were studied in Tyler (1987), Dümbgen (1998), Dümbgen and Tyler (2005) and Sirkiä et al. (2007) among others. Corresponding estimates can be computed using simple iterative algorithms based on estimating equations (Hettmansperger and Randles, 2002;Oja and Randles, 2004). In the next section we compare different spatial covariance matrices in a PCA framework.

Influence functions
The influence function measures the robustness of a functional T against a single outlier x (Hampel et al., 1986). Let F ǫ = (1−ǫ)F +ǫ∆ x denote a contaminated distribution, where ∆ x is the cdf of a distribution with probability one at point x. The influence function of T is defined by

A continuous and bounded influence function indicates good local robustness properties of an estimator.
Write C(F x ) for a scatter matrix functional decomposed as in (3). Croux and Haesbroeck (2000) showed that under an elliptical distribution F with distinct eigenvalues of the covariance matrix Σ, the influence function of the eigenvector functional p C,j is where γ k is the k-th eigenvector of Σ as defined in Section 2.1. Influence functions in case of multiple multiple roots are studied in Tanaka (1988).
To derive influence functions for eigenvectors based on orthogonally equivariant SCov(F x ) and SSCov(F x ), we proceed as in the proof of Theorem 1 and use the orthogonal transformation z = Γ T (x − µ).
Theorem 2. Let x be a random vector from an elliptical distribution F with parameters µ and Σ. Then the influence function of j-th eigenvector functional p S,j based on SCov(F x ) is given by where z, z 1 and z 2 are independent observations from F Λ and λ S,j s are defined in (9).
To compare the influence functions of the theorem, we plot the norm of the influence function of the first eigenvector for different spatial covariance matrices and for the usual sample covariance matrix using a simple bivariate normal model with covariance matrix Σ = diag(2, 1). The influence function of the first eigenvector based on the sample covariance matrix is computed using (14) below. Figure 1 shows that the eigenvector based on SSCov(F x ) (right plot) has some very desirable properties: At the centre of the distribution, the shape of the influence function is similar to that of the sample covariance matrix (left plot), but points far away from the centre have no influence on the estimator. In comparison, SCov(F x ) (middle plot) downweights the outliers, but at the centre of the distribution the shape of  Figure 1: Norm of the influence function of the first eigenvector for the sample covariance matrix (left), the spatial sign covariance matrix (middle) and the symmetrised spatial sign covariance matrix (right) with F = N (0, diag(2, 1)).
the influence function is different from that of the sample covariance matrix indicating poor efficiency properties.
We next derive the influence functions of eigenvector estimators based on SCov 2 (F x ) and SSCov 2 (F x ). Due to affine equivariance, the corresponding influence functions are easily obtained using the transformation z ′ = Λ −1/2 Γ T (x − µ) = ru, where r = ||z ′ || and u = r −1 z ′ , and r and u are independent. At elliptical F , the influence functions of all shape matrices V (F x ) are of the form where α V (r) is a real-valued weight function that depends on V (F x ) and the underlying spherical distribution (Taskinen et al., 2006). Combining (12) and (13), the influence function of the eigenvector functional p V,j based on any affine equivariant estimator V (F x ) is thus and comparisons between different estimators may be based on α V (r) functions only. For SCov 2 (F x ) and SSCov 2 (F x ) these functions are derived in Tyler (1987) and Sirkiä et al. (2007). We denote them as α S 2 (r) = p + 2 and α SS 2 (r) = 2(p + 2)(1 − pg(r)), where g(r) = E z ′ 1 (z ′ 2 12 /||z ′ 1 − re 1 || 2 ), with e 1 = (1, 0, . . . , 0) T . Note that α S 2 (r) is constant in r, thus making the estimates based on SCov 2 (F x ) highly robust, but not very efficient. As shown in Sirkiä et al. (2007), α SS 2 (r) is continuous and bounded, whereas for the sample covariance matrix, α Cov (r) = r 2 , which makes the corresponding estimates highly sensitive to outliers.

Asymptotic distributions and efficiencies
Let X = (x 1 , . . . , x n ) T be n × p data matrix. We denote byĈ the estimator associated with the functional C(F ), that is,Ĉ = C(F n ), where F n is the empirical distribution function based on X. The asymptotic behaviour of the eigenvector estimators based on different sign covariance matrices can be derived as in Anderson (1984) using the asymptotic normality of the corresponding covariance matrix estimators. As before, in the following theorem we assume that Σ has no multiple roots. For the limiting distribution of a Σ with multiple roots, see Anderson (1963). The vec-operator below vectorizes a matrix by stacking the columns of the matrix on top of each other. is that of a weighted sum of independent chi-square random variables each with one degree of freedom.
Since SCov and SSCov are asymptotically normal under an elliptical F (see eg. Visuri et al. (2000)), the limiting normality of the corresponding eigenvector estimators follows from Theorem 3.
Corollary 1. Let F be an elliptical distribution with parameters µ and Σ. Then √ n(p S,j − γ j ) and √ n(p SS,j − γ j ) have limiting normal distributions with zero mean and asymptotic variances (ASV) where z 1 and z 2 are independent observations from F Λ , u = (u 1 , . . . , u p ) T is uniformly distributed on the unit sphere and the λ S,j 's are defined in (9).
The asymptotic distributions of SCov 2 and SSCov 2 under an elliptical F are derived for example by Tyler (1987) and Dümbgen (1998). The limiting normality of the corresponding eigenvector estimators thus follows.
Corollary 2. Let F be an elliptical distribution with parameters µ and Σ, and writep V,j for the eigenvector estimate based on any shape matrix esti-matorV . Then √ n(p V,j − γ j ) has a limiting normal distribution with zero mean and asymptotic variance The α V (r) functions for SCov 2 (F x ) and SSCov 2 (F x ) are given in (15).
To compare efficiency properties of estimators, we proceed as in Croux et al., (2002). The asymptotic relative efficiencies (ARE) of eigenvector estimators based on the scatter (or shape) estimatorĈ 1 relative to the eigenvector estimators based onĈ 2 are then computed using In Table 1, we list the asymptotic relative efficiencies for the first eigenvector estimates based on different spatial sign covariance matrices with respect to estimates based on sample covariance matrix. The efficiencies for SCov and SSCov are computed at different p-variate t-distribution cases with Σ = diag(p, p − 1, . . . , 1) and selected values of degrees of freedom ν with ν = ∞ referring to the multivariate normal case.
Notice that when efficiencies of eigenvector estimators based on two affine equivariant shape estimates,V 1 andV 2 , are compared, (16) reduces to where functions α V i , given in (13), depend only on corresponding shape matrices as well as the underlying spherical distribution. The efficiencies of eigenvector estimators are therefore given by the efficiencies of corresponding shape matrices. When computing the efficiencies for SCov 2 and SSCov 2 , we thus use spherical p-variate t-distributions with Σ = I p . The results are listed in Table 1.
As seen in the table, the eigenvectors based on the symmetrised spatial sign covariance matrix SSCov and Dümbgen's SSCov 2 are almost as efficient as those based on the classical sample covariance matrix in the multivariate normal case. For heavy-tailed distributions, the efficiencies are much higher. The eigenvectors based on the spatial sign covariance matrix SCov and Tyler's shape estimate SCov 2 have much lower efficiencies than their symmetrised counterparts, but for heavy-tailed distributions, these estimators outperform the classical ones.

Conclusions
In this paper, orthogonally equivariant spatial sign covariance matrices as well as their affine equivariant counterparts were compared in a PCA framework. Robustness and efficiency studies showed that the eigenvectors based on different covariance matrix estimators are highly resistant to outliers, and, in particular, the eigenvectors based on symmetrised estimators have exellent asymptotic efficiencies. Estimates based on spatial signs are easy and fast to compute even for high-dimensional data thereby providing practical robust alternatives to classical sample covariance matrix based methods. Programs for computing estimators based on spatial sign vectors are available in the R-packages SpatialNP and MNM (Nordhausen and Oja, 2011). Table 1: AREs of first eigenvector estimates based on (a) SSCov and SCov (between parenthesis) and (b) SSCov 2 and SCov 2 (between parenthesis) as compared to eigenvector estimates based on sample covariance matrix. The underlying distribution is the multivariate t-distribution with (a) Σ = diag(p, p − 1, . . . , 1) and (b) Σ = I p , and selected values of p and ν.