On the computation of symmetrized M-estimators of scatter

This paper focuses on the computational aspects of symmetrized M-estimators of scatter, i.e. the multivariate M-estimators of scatter computed on the pairwise differences of the data. Such estimators do not require a location estimate, and more importantly, they possess the important block and joint independence properties. These properties are needed, for example, when solving the independent component analysis problem. Classical and recently developed algorithms for computing the M-estimators and the symmetrized M-estimators are discussed. The effect of parallelization is considered as well as new computational approach based on using only a subset of pairwise differences. Efﬁciencies and computation time comparisons are made using simulation studies under multivariate elliptically symmetric models and under independent component models.


Introduction
Almost all of the classical multivariate methods, including principal component analysis, multivariate regression, canonical correlation analysis, etc., are dependent on the use of the sample covariance matrix. It is well known that under the assumption of multivariate normality, the methods based on this estimator are optimal. However, if the normality assumption is not satisfied, e.g., if the data are con-taminated with outlying observations or have heavier tails than that of the normal distribution, then methods based on the sample covariance matrix perform poorly.
A widely used approach for robustifying classical multivariate methods is the socalled plug-in approach. In this approach, the sample covariance matrix is replaced by a robust scatter matrix. As a consequence a vast variety of robust alternatives for the sample covariance matrix have been proposed in the literature. Some widely used robust estimators include M-estimators (Maronna, 1976;Huber, 1981), MCDestimators (Rousseeuw, 1985) and S-estimators (Davies, 1987;Lopuhaä, 1989), among others. For an overview of robust multivariate methods, see Maronna et al (2006).
When robust plug-in methods are proposed, one important issue is often ignored, namely that a multivariate method may not be valid unless the robust scatter matrix satisfies certain crucial properties that hold for the sample covariance matrix. In Nordhausen and Tyler (2015) a thorough discussion of such properties is given. Focusing on the so-called joint and block independence properties (defined in the next section), Nordhausen and Tyler (2015) give several examples of plug-in multivariate methods, which are not valid unless the scatter matrix possesses these properties. Examples include independent component analysis, observational regression, and graphical modeling. For the role of scatter matrices in independent component analysis, see also Oja et al (2006), Nordhausen et al (2008), and Tyler et al (2009), among others.
In Oja et al (2006) it is shown that by computing any scatter matrix using pairwise differences rather than the observations themselves produces an estimator with the joint independence property. Sirkiä et al (2007) discuss general symmetrized M-estimators, and give as examples the symmetrized Huber estimators, and Dümbgen's (1998) estimator, which is a symmetrized version of Tyler's (1987) M-estimator. Croux et al (1994) and Roelant et al (2009) propose using symmetrized S-estimators in univariate and multivariate regression settings, respectively, with their main focus being on improving efficiency at the normal model.
As symmetrized estimators are defined using pairwise differences, the computations become intensive with increasing sample size. In this paper we focus on the computational aspects and consider a few practical ways to handle this problem, especially in the context of M-estimates. The paper is organized as follows. In Section 2 we recall the definitions of scatter matrix and block and joint independence, and in Section 3 the definition and main properties of symmetrized M-estimators of scatter. Section 4 provides some new approaches for computing symmetrized estimators. In Sections 5 and 6 simulation studies are given to compare efficiencies and computation times of different approaches, respectively. The paper is concluded with some discussion in Section 7.

Scatter matrices and block independence
Recall first the definition of a scatter matrix functional.
Definition 1. Let x be a p-variate random vector with cumulative distribution function F x . Then a p × p matrix valued functional V = V(F x ) is a scatter matrix functional if it is symmetric, positive semi-definite and affine equivariant in the sense that for any full rank p × p matrix A and any p-vector b.
A scatter matrix is then naturally defined asV = V(F n ), where F n is the empirical cdf. Most robust counterparts of covariance matrix satisfy (1). However, they usually do not satisfy the so-called joint and block independence properties, which are characteristic of the covariance matrix, and are defined as follows.
ii) If k = p, which means that x has independent components, and V(F x ) is a diagonal matrix, then it is said to have the joint independence property.
Note that the block independence property implies the joint independence property, but not vice versa. In Nordhausen and Tyler (2015) several examples of multivariate methods are given for which it is necessary for a scatter matrix to possess the joint or block independence property. Most scatter functionals do not posses the joint or block independence property. A common conjecture here is that only scatter matrices which can be expressed as functions of pairwise differences have this property. For example COV( , where x 1 and x 2 denote independent copies of x, can be written in such a way. What about scatter matrices which cannot be expressed using pairwise differences? A quite simple but ingenious approach is to apply a scatter functional to the pairwise differences, which is known as symmetrization. Theorem 1 in Oja et al (2006) shows that when a scatter matrix functional is applied to the pairwise differences of the observations, then the resulting functional possesses the joint independence property. In Nordhausen and Oja (2011) and Nordhausen and Tyler (2015) it is shown that symmetrization yields to a more general block independence property.
A formal definition of symmetrization is given as follows.
Definition 3. Let V(F x ) be any scatter functional. Then the corresponding symmetrized scatter functional is defined as where x 1 and x 2 are two independent copies of x.
In this paper we are mainly interested in computational aspects of symmetrized scatter matrices. Although the computational issues discussed herein apply to any symmetrized scatter matrix, this paper focuses on symmetrized M-estimators of scatter (Sirkiä et al, 2007), which, as to be seen, can be made computationally feasible for fairly large sample sizes. The next section reviews the definition and basic properties of symmetrized M-estimators of scatter.

Symmetrized M-estimators of scatter
Write again x for a p-variate random vector with cumulative distribution function F x . In this paper we focus on elliptically symmetric distributions. Such a distribution family is often used in robustness studies as it includes distributions with heavy tails (e.g. elliptical Cauchy distribution) as well as distributions which can be used to generate atypical observations (e.g. contaminated normal distribution).
An elliptically symmetric distribution is obtained as an affine transformation of a spherical distribution. Recall that a p-variate random vector z is spherically symmetric around the origin if Uz ∼ z for all orthogonal p× p matrices U. Then x = Ω Ω Ω z+ µ µ µ, where Ω Ω Ω is a full rank p × p matrix and µ µ µ a p-vector, has an elliptically symmetric distribution with density of the form where g(z) = exp(−ρ(||z||)) represents the density of z, with ρ(·) being a nonnegative function, and Σ Σ Σ = Ω Ω Ω Ω Ω Ω t . Without loss of generality, Σ Σ Σ 1/2 is taken to be the symmetric positive definite square-root of Σ Σ Σ . Note that the density of z depends only on the value of its radius ||z||, and the function ρ(·) does not depend on the parameters µ µ µ and Σ Σ Σ .
The parameter µ µ µ is the location center of the distribution and the scatter matrix Σ Σ Σ is proportional to the regular covariance matrix (if it exists). Examples of function g(·) include g(z) = (2π) −p/2 exp(−z t z/2), which corresponds to the p-variate normal distribution, and which corresponds to the p-variate t-distribution on ν degrees of freedom. Within the class of elliptical distribution, i.e. for unknown g, only the location µ µ µ and the "shape" of Σ Σ Σ , i.e. the value of Σ Σ Σ up to proportionality, is well defined, whereas the constant of proportionality is confounded with the function g.
Next, we recall the definition of the symmetrized M-functional as given in Sirkiä et al (2007).

Definition 4.
Assume that x is a p-variate random vector with cdf F x , and let x 1 and x 2 be two independent copies of x.
] 1/2 , and w 1 and w 2 are some real-valued functions on [0, ∞). Sirkiä et al (2007) observe that the assumptions on the weight functions and on the distribution of the pairwise differences needed for the existence and uniqueness of symmetrized M-functionals follow from Huber's (1981) results for (nonsymmetrized) M-functionals. When x has an elliptical distribution, V s ∝ Σ Σ Σ , with the constant of proportionality being dependent on the weight functions w 1 and w 2 and the density g, but not on the parameters µ µ µ or Σ Σ Σ .
An estimator corresponding to a scatter matrix functional V s is obtained when F x 1 −x 2 in Definition 4 is replaced with the empirical distribution function of the pairwise differences. A symmetrized M-estimator of scatter,V s , then solves ( n 2 where w 1 and w 2 are real-valued functions on [0, ∞). Notice that choices w 1 (r) = ρ ′ (r)/r and w 2 (r) = 2 yield the maximum likelihood estimator under a specific elliptical distribution. The robustness properties and limiting distributions of general symmetrized M-estimators were discussed in Sirkiä et al (2007).
In this paper we consider the following symmetrized M-estimators: • The sample covariance matrix, which corresponds to w 1 (r) = 1 and w 2 (r) = 2, or equivalently to w 1 (r) = 1/2 and w 2 (r) = 1 • The symmetrized Cauchy M-estimator, which has weight functions corresponding to those of the maximum likelihood estimator for the elliptical Cauchy distribution, i.e. w 1 (r) = (1 + p)/(1 + r 2 ) and w 2 (r) = 1. It is worth noting that this is not the same as the maximum likelihood estimator based on the pairwise differences from a random sample of an elliptical Cauchy distribution.
• The symmetrized Huber estimators, which have weight functions w 2 (r) = 1 and where c is a tuning constant defined so that q = Pr(χ 2 p ≤ c 2 /2) for a chosen q. The scaling factor σ is chosen so that which makes the estimator Fisher-consistent for Σ Σ Σ at the multivariate normal model.
Dümbgen's estimator is only define up to proportionality, i.e. bothV s,1 andV s,2 satisfy the corresponding estimating equations, if and only ifV s,1 ∝V s,2 . Furthermore, as noted previously, under sampling from an elliptical distribution, the symmetrized Cauchy M-estimator is Fisher-consistent for the parameter Σ Σ Σ only up to proportionality. This is also true of the sample covariance matrix and the symmetrized Huber M-estimator at elliptical models other than the multivariate normal. These factors, though, are not important to the efficiency comparisons given in Section 5 since only the shape of the scatter matrices are considered in these comparisons.

Computation of symmetrized M-estimators
Hereafter, we consider only the case w 2 (·) = 1, which agrees with the original definition of the M-estimators given in Maronna (1976). Note that this case holds for the three M-estimators discussed in the previous section, as well as for the maximum likelihood estimators of scatter under an elliptical family of distributions, i.e. with a fixed g. A general recent overview of the M-estimators of scatter for the case w 2 (·) = 1 can be found in Dümbgen et al (2015b). They point out that the most commonly used method to compute such M-estimates is via a simple fixed point algorithm, which is known to converge under very general conditions to a unique solution, regardless of the initial value, as shown in Kent and Tyler (1991). Assume in the following that we have a sample of n vectors X = (x 1 , . . . , x n ) and the goal is to compute the symmetrized M-scatter matrix V s of interest. The most naive approach would be to apply the fixed point algorithm for the unsymmetrized scatter of interest to all n(n − 1) pairwise differences x i − x j with i ̸ = j. Notice that now the location center does not need to be estimated as for the symmetrized vectors the location center is naturally the origin. Nevertheless, even for a moderate sample size n, the computational burden can be tremendous and so new approaches are needed to to deal with this. In the following we will consider a few practical ways to reduce the computational burden and memory demand.
The number of pairwise differences can be halved since only the n(n − 1)/2 pairwise differences x i − x j with i < j are needed to compute the symmetrized scatter matrix. Hence the most basic algorithm is the fixed point algorithm with updating step at iteration k: where r k i j is based on the current scatter estimate V k s . Recently, Nordhausen and Tyler (2015) suggested rewriting the above algorithm as The computation of S k+1 i (X) then can be naturally divided into several threads. We refer to this algorithm as the the parallel algorithm and study, in Section 6, how much this approach speeds up computations. Dümbgen et al (2015a) have recently argued that using a fixed point algorithm for computing M-estimates of scatter can be less than optimal. They consider several alternative algorithms and recommended a partial Newton (PN) algorithm, which, in most cases, is considerably faster. The basic idea behind the PN algorithm is to first perform a few fixed point steps and to then evaluate whether shifting to a Newton-Raphson step with an approximated Hessian is better. We refer to the reader to the aforementioned paper for more details regarding the algorithm. A restriction of the PN algorithm is that the weight functions must be smooth, which excludes, for example, Huber's weight functions. Two versions of the PN algorithm were introduced in Dümbgen et al (2015a), with one version requiring all pairwise differences x i − x j with i < j being in the memory, and the other version being a sequential algorithm which avoids storing all pairwise differences. The sequential algorithm seems to be, in most cases, faster than the one that stores all pairwise differences.
We have, thus, several algorithms available so far for the computation of symmetrized M-estimators of scatter. However, these are all computationally intensive as they either store all pairwise differences x i − x j with i < j in the memory or compute sequentially all quantities of interest. This computational burden is demonstrated later in Section 6.
A possible way to ease this computational problem can be motivated by noting the resemblance of the symmetrized scatter matrix to a U-statistic of order two. Recall that for a sample X = (x 1 , . . . , x n ) a U-statistic for a parameter θ based on a symmetric kernel h(x (1) , . . . , x (K) ) of order K is defined as and the kernel is computed for all possible subsamples of size K denoted by x (1) , . . . , x (K) . A simple example of a U-statistic is the sample covariance matrix, which has a kernel of order two and can be expressed as In general, though, not all symmetrized scatter matrices can be expressed as Ustatistics, since they typically have only an implicit rather than an explicit representation in terms of pairwise differences.
In the context of U-statistics, Blom (1976) noted that it is possible to use less than N terms without losing much information when estimating θ , and he called such estimates incomplete U-statistics. Such estimates have also been referred to as weighted U-statistics, with weights 0 or 1, or as reduced U-statistics. In Blom (1976) and Brown and Kildea (1978) the statistical properties of incomplete U-statistics are derived.
Following the idea of incomplete U-statistics, many ways to choose the terms used in computations are possible. The most basic one is independent subsampling, where m sets out all N sets are chosen at random. This can, however, give different weight to different observations in the data. Another convenient choice for kernels of order K = 2, which gives each observation equal weight, is what we refer to as a "running average of length m". For this purpose, we treat the ordering of the data as cyclic and define an extended data matrix X * = (x * 1 , . . . , x * n+m ) = (x 1 , . . . , x n , x 1 , . . . , x m ). Our incomplete symmetrized M-estimator of length m,V I , then solves In the following we explore the idea of computing symmetrized scatter matrices by using running averages of different lengths m, and compare the loss in efficiency to the gain in computation time. From a practical point of view, the observation order should be randomly permuted in order to avoid the effect of how the data was recorded. Using permutations in the simulations, though, are not needed since the simulated data set follows the same model as any permutation of it.

Efficiency comparisons
In this section, we compare the efficiencies of the incomplete (using only mn pairwise differences) symmetrized estimators to that of the corresponding complete symmetrized estimator. We include the symmetrized Huber estimator with q = 0.90, Dümbgen's estimator and the symmetrized Cauchy M-estimator in the comparisons. Since, as previously discussed, the estimators are only comparable up to proportionality, we standardize all estimators so that their traces are equal to p.
To compare the finite sample efficiencies, we first carried out a simulation study with samples of size n =1000, 2000 and 4000, dimensions p = 3 and p = 8, and under the normal distribution (N), the contaminated normal distribution (cN) and the t-distribution on 5 degrees of freedom (t 5 ). The cumulative distribution function of the contaminated normal distribution is Φ ε,c (x) = (1 − ε)Φ(x) + εΦ(x/c), where ε, c > 0 and Φ denotes the cumulative distribution function of the standard multivariate normal distribution. In our simulation settings, we used ε = 0.1 and c = 3.
The asymptotic efficiencies of the three standardized robust scatter estimators relative to the standardized sample covariance matrix are listed in Table 1. The asymptotic relative efficiencies were computed using the results in Sirkiä et al (2007), wherein they observed that the symmetrized Huber estimator and the Dümbgen's estimator are highly efficient not only at heavy tailed distributions but also at the multivariate normal distribution. The symmetrized Cauchy M-estimator, though, suffers from some efficiency loss at the multivariate normal distribution case. Table 1 Asymptotic efficiencies of the symmetrized Huber M-estimator, Dümbgen's estimator and the symmetrized Cauchy M-estimator relative to the sample covariance matrix. The asymptotic relative efficiencies are evaluated at the normal (N), the contaminated normal (cN) and t-distribution on 5 degrees of freedom (t 5 ). To compare the finite sample efficiencies, the mean squared errors of the offdiagonal elements of the standardized scatter matrices, that is, were computed using N = 2000 samples. The efficiencies were then defined by taking the ratios of the corresponding MSEs. The results are listed in Tables 2-4. For all of the estimators, there is some loss, but somewhat surprising not a large loss, in efficiency when m = 10, and when m = 20, the efficiency loss is always less than 5%. The loss in efficiency is slightly worst for the Dümbgen's estimator than for the other estimators.
Among the symmetrized scatter matrices considered in this paper, only the sample covariance matrix is a U-statistic. Nevertheless, the simulations indicate that all scatter matrices computed using running averages of length m seem to behave in a similar fashion. These empirical results suggest that theoretical results obtained for the incomplete sample covariance matrix may give us insight into the behavior of other incomplete symmetrized estimates of scatter.
In particular, results from Brown and Kildea (1978) for incomplete U-statistics, allow us to compute the asymptotic relative efficiency of the incomplete sample covariance estimator with respect to the complete sample covariance matrix. For a spherically symmetric distribution with COV(z) = I p , the efficiency of the incomplete symmetrized sample covariance matrix relative to the complete one is where κ = E[z 2 i z 2 j ]/(2(E[z 2 i z 2 j ] + 1)), and z i and z j are different components of z. In Figure 1 we plot the asymptotic relative efficiency of the symmetrized incomplete sample covariance matrix as a function of m for the 3-variate normal, contaminated normal and t 5 -distribution, for which κ = 1/4, 25/68, and 3/8 respectively. We also  simulated the finite sample efficiencies, which correspond to the dash lines in the figures, computed as ratios of MSEs using n = 1000 and N = 2000 repetitions. It can be seen that the efficiencies increase rapidly as a function of m, with a limit of one as m → ∞. Interestingly, the efficiency at m = 1 is notably higher in the case of heavy tailed distributions than in the case of the normal distribution. The choice m = 20 is sufficient to produce an estimator with very high efficiency. All simulations so far have focused on data coming from an elliptical model, among which the only distribution with independent marginals is the multivariate normal distribution. However, there are many areas of applications, such as independent components analysis, for which independent marginals outside of the multivariate normal distribution are of interest. Consequently, we also simulated data for different sample sizes and dimensions from a model with mutually independent components where each component has a standard exponential distribution. Here, if the scatter functional possesses the joint independence property, then the offdiagonal values of the scatter matrix are equal to zero. For this setting, we compare the symmetrized M-estimators (Dümbgen's estimator and the Cauchy M-estimator) based on all pairwise differences to the corresponding estimators using running averages of length 20. Figure 2 gives the mean squared errors of the off-diagonal values based on 1000 repetitions. The figure shows that in this case the incomplete estimators with m = 20 behave similarly to the regular symmetrized estimators. For comparison, we also compute the corresponding non-symmetrized versions of the scatter matrices (Tyler's estimator and the Cauchy M-estimator, which corresponds to the MLE for the Cauchy distribution, respectively). These estimators do not possess the desired joint independence property, and so the corresponding functionals of these non-symmetrized versions do not have zero off-diagonal elements even though the variables are independent. Consequently, as seen in Figure 2, their MSEs do not go to zero as n increases.
As this section demonstrates, using running average sets of pairwise differences with small to moderate values of m results in only a small loss of efficiency relative to their complete version. In the next section we will see how this small loss in efficiency pays off in computation time.

Computation times
In comparing the computation times of the different algorithms presented in Section 4, we again chose the two dimensions p = 3 and p = 8, and use five sample sizes n =1000, 2000, 4000, 8000, 16000. For each combination of p and n, 50 samples from the multivariate t-distribution with 5 degrees of freedom are generated, and the computation times of the different algorithms are measured. The scatter matrices under consideration are the same as those used in previous sections. For the symmetrized Cauchy M-estimator and for the Dümbgen's estimator, both the fixed point algorithms and the partial Newton algorithms can be found in the R-packages ICSNP (Nordhausen et al, 2012) and fastM (Dümbgen et al, 2014), respectively. The symmetrized Huber estimator can also be computed using the R-package IC-SNP. Currently, there are plans to implement the running average versions of the estimators in these package.
Our main interest in the following comparisons is two-fold. First, we are interested in when parallelization is beneficial, and second, in how fast the running average versions are relative to the standard implementations. In the simulations we chose m = 20 for the incomplete estimators as this was in all cases considered to yield highly efficient estimators. We also used the partial Newton algorithm from the fastM package that uses sequential computations as this seems to be faster than having all pairwise differences in the memory (Dümbgen et al, 2015a). All functions are mainly written in C or C++ with an R interface and should be therefore comparable (but have sometimes slightly different convergence criteria). The comparisons were done using R 3.1.1 (R Core Team, 2014) on a Intel(R) Core(TM) i7-3770 CPU with 3.40 GHz, 32 GB of memory using 64-bit Red Hat Linux. As expected, the regular fixed point algorithm utilizing all pairwise differences is the slowest while the incomplete estimator is the fastest to compute. The ratio of their computation times is approximately the ratio of the number of pairs, which is 0.5(n − 1)/m. With large sample sizes, using two cores gains approximately 50% in computation time and using four cores approximately 75% relative to using only one core. We compared also the computation times when using six cores, but the computation times did not differ significantly different the version using four cores.
Notice that the gain percentage of the parallel computation grows with the sample size and the dimension until it reaches a limiting level. Parallelization becomes beneficial somewhere around n =2000 when p = 3 and around n =1000 when p = 8.
As already pointed out in Dümbgen et al (2015a), the partial Newton algorithm is not considerably faster than the fixed point algorithm when computing Dümbgen's estimator. However, the PN algorithm computes Dümbgen's estimator and the symmetrized Cauchy M-estimator equally fast, whereas all the other algorithms compute Dümbgen's estimator much faster than the symmetrized Cauchy M-estimator; for p = 3 and p = 8, approximately 5 and 18 times faster, respectively. Hence, the PN algorithm is superior to parallelized fixed point algorithm using four cores for the symmetrized Cauchy M-estimator, and vice versa for Dümbgen's estimator. Recall that the PN algorithm cannot be applied to the Huber estimator since the weight functions are not smooth. The computation time of the symmetrized Huber estimator is approximately the same as that of the Dümbgen's estimator when p = 3, but twice as long when p = 8.

Discussion
The relevance of symmetrized scatter matrices has only recently been recognized within the statistics literature. The benefit of using such scatter matrices is twofold: (i) they do not require a location estimate, and (ii) they possess the joint and the block independence properties, which are necessary properties for many multivari- ate methods. These benefits, however, come at a cost, namely that symmetrized scatter matrices tend to be more computationally intensive and are slightly less robust than their unsymmetrized counterparts. In this paper, the computational aspects of symmetrized M-estimators have been considered. In particular, it is shown that parallelization of the fixed-point algorithm is possible for these M-estimators and that this provides a considerable gain when, for example, four cores are used. Paralleization of the fixed point algorithm alone, though, is not as computationally efficient as more recently proposed partial Newton algorithms. Another computational alternative, proposed within the paper, is motivated by results on incomplete U-statistics, namely to reduce the number of pairwise differences used in computations. Such an approach proves to be promising. A huge gain in computation time is achieved with only a small loss in efficiency. Finally, we note that while the parallelization approach is specific for M-estimators, the subsampling of pairwise differences can be applied to any symmetrized scatter matrix.