Tests of multinormality based on location vectors and scatter matrices

Classical univariate measures of asymmetry such as Pearson’s (mean-median)/ (cid:27) or (mean-mode)/ (cid:27) often measure the standardized distance between two separate location parameters and have been widely used in assessing univariate normality. Similarly, measures of univariate kurtosis are often just ratios of two scale measures. The classical standardized fourth moment and the ratio of the mean deviation to the standard deviation serve as examples. In this paper we consider tests of multinormality which are based on the Mahalanobis distance between two multivariate location vector estimates or on the (matrix) distance between two scatter matrix estimates, respectively. Asymptotic theory is developed to provide approximate null distributions as well as to consider asymptotic e(cid:14)encies. Limiting Pitman e(cid:14)ciencies for contiguous sequences of contaminated normal distributions are calculated and the e(cid:14)ciencies are compared to those of the classical tests by Mardia. Simulations are used to compare (cid:12)nite sample e(cid:14)ciencies. The theory is also illustrated by an example.


Introduction
Classical multivariate analysis is mostly based on the assumption that the data come from a multivariate normal distribution. The tests of multinormality have therefore received much attention and several tests for assessing multinormality have been proposed in the literature. See e.g. Gnanadesikan (1977), Cox and Small (1978), Mardia (1980) and Romeu and Ozturk (1993), for reviews. If the hypothesis of multinormality is then rejected, alternative robust and nonparametric multivariate procedures can be used.
In the univariate case, standardized third and fourth moments are often used to indicate the skewness and kurtosis. The standardization is then made using the sample mean and sample variance. In this spirit, if X = (x 1 , . . . , x n ) is a random sample from a k-variate normal distribution N (µ, Σ), the first step in assessing multivariate normality is to standardize the data vectors using the sample mean vector and sample covariance matrixx the regular estimates of µ and Σ, respectively. Let then z i = S −1/2 (x i −x), be the scaled residual, and let r i = ||z i || and u i = ||z i || −1 z i be its radius and the direction vector, i = 1, . . . , n. Note that r 2 i is the squared Mahalanobis distance between x i andx. Moreover, r ij = z T i z j is known as the cosine of the Mahalanobis angle between the vectors x i −x and x j −x. Mardia (1970) then defined multivariate measures of skewness and kurtosis by b 1 = ave i,j {r 3 ij } and b 2 = ave i {r 4 i }.
Under the hypothesis of multinormality, the limiting distribution of nb 1 /6 is a chi-square distribution with k(k + 1)(k + 2)/6 degrees of freedom and the limiting distribution of (b 2 − k(k + 2))/(8k(k + 2)/n) 1/2 is N (0, 1). For different definitions of multivariate skewness and kurtosis, see Malkovich and Afifi (1973) and Koziol (1986Koziol ( ,1987. Bera and John (1983) used certain third and forth moments of the scaled residuals, T r = ave i {z 3 ri } and T rs = ave i {z 2 ri z 2 si }, r, s = 1, . . . , k, to construct four different test statistics for multinormality and compared them with Mardia's skewness and kurtosis tests under multivariate Pearson family of distributions. Also Koziol (1982Koziol ( ,1983Koziol ( ,1986Koziol ( ,1987Koziol ( ,1993 used third and fourth moments to construct tests for multinormality. He noted that the smooth tests may be decomposed into individual components, which are distributed as independent χ 2 random variables and compared those components under different kinds of alternatives. Classical univariate measures of asymmetry, like Pearson's β 1 , are of the form (mean-median)/σ or (mean-mode)/σ and measure the distance between two location parameters (Pearson, 1895). These measures have been widely used in assessing univariate normality, see MacGillivray (1986) for survey of work on skewness. On the other hand, measures of univariate kurtosis are often just ratios of two scale measures. Examples include classical kurtosis measure, standardized fourth moment and the ratio of the mean deviation to the standard deviation (Geary, 1935), among others. For measures of univariate skewness and kurtosis, see also Oja (1981).
In the paper we construct new test statistics for multinormality which are based on multivariate location vectors and scatter matrices. Our plan is as follows. In Section 2, location vectors and scatter matrices are introduced. We also give the general form of their influence functions and asymptotic variances and present some affine equivariant location and scatter matrix estimators. In Section 3, new statistics for testing multinormality are presented and their large sample properties are studied. Asymptotic efficiencies are given in Section 4 and in Section 5, simulations are used to compare the finite sample powers of tests. In Section 6, the theory is illustrated by an example and the paper is concluded with some comments in Section 7.

Location vectors and scatter matrices
Let X = (x 1 , . . . , x n ) be a random sample from a k-variate distribution with cdf F . As usually, write F n for the empirical cdf. Our tests of multinormality are based on two separate location vector or scatter matrix estimates. A k- We assume that it is possible to define (in a 'large' family of distributions) a corresponding statistical functional T (F ) so that T (F n ) = T (X). A k × k-matrix valued statistic C = C(X) is a scatter matrix if it is affine equivariant, which now means, that Assume again that a corresponding statistical functional C(F ) can be defined so that C(F n ) = C(X).
It is easy to see that, if F is elliptic with mean vector µ and covariance matrix Σ (and they exist), then where c F depends on distribution F and scatter functional C. If multinormality is assumed, then c F is known for each C and correction factor can be used to guarantee the consistency to the regular covariance matrix. To conclude, under the multinormality, all location vectors estimate the same quantity µ and all scatter matrices equipped with the correction factor estimate the covariance matrix Σ.
For robust functionals T and C, functions γ T (r) and α C (r) and β C (r) are continuous and bounded. In the paper we assume that, under multinormality, the estimates T (X) and C(X) are consistent to µ and Σ, respectively, and that the limiting distributions of √ n(T (X) − µ) and √ nvec(C(X) − Σ) are multinormal. Then the asymptotic covariance matrices of T (X) and vec(C(X)) are given by See e.g. Tyler (1982).
Besides sample mean and sample covariance matrix, several affine equivariant location vector and scatter matrix estimators are introduced in the literature. Location and scatter M-functionals (Maronna, 1976;Huber, 1981) are defined as functionals T (F ) and C(F ) which satisfy implicit equations where and v 1 and v 2 are real-valued functions on [0, ∞). As an example of Mfunctionals, consider Huber's M-functionals M (q) that use v 1 (r) = 1, r ≤ c c/r, r > c and σv 2 (r) = 1, where c is a tuning constant defined so that q = P r(χ 2 k ≤ c 2 ). The scaling factor σ is chosen so that E[v 2 (r)r 2 ] = k, where r 2 is a random variable from a χ 2 k distribution. Then C(F ) = Σ at multivariate normal population. The influence functions of location and scatter M-functionals are derived in Huber (1981;Section 8.7). Note also that choosing v 1 (r) = 1/r and v 2 (r) = k/r 2 in M-estimation equations yield the transformation-retransformation spatial median (Hettmansperger and Randles, 2002). The influence function of this median is given in Ollila et al. (2003).
If now T 1 (F ) and C 1 (F ) are any location and scatter functionals then one-step M-functionals of location and scatter, starting from T 1 (F ) and C 1 (F ), are given by where r 2 = ||x−T 1 (F )|| 2 C1(F ) . Again, to quarantee the consistency of C 2 (F ) to Σ at multivariate normal distribution, the weight function v 2 should be scaled so that E[v 2 (r)r 2 ] = k, where r 2 is a random variable from a χ 2 k distribution. The influence functions of one-step M-functionals may be written using the influence functions of initial estimators. The influence functions are given by (1) and (2) with Location and scatter S-functionals (Rousseeuw and Leroy, 1987;Davies, 1987) are defined as the solutions T (F ) and C(F ) to the problem of minimizing det(C(F )) subject to where r 2 = ||x − T (F )|| 2 C(F ) and ρ(r) is bounded, increasing and nonnegative function. The constant b is generally chosen to be b = E[ρ(r)] where r 2 is a random variable from a χ 2 k . Then C(F ) = Σ at multinormal population. An example of function ρ is Tukey's biweight ρ-function To attain % breakdown point, the tuning constant c is chosen so that b = ρ(c). For the influence functions of S-functionals, see e.g Lopuhaä (1989).

New test statistics
As before let X = (x 1 , . . . , x n ) be a random sample from a k-variate distribution with cdf F . We wish to test the null hypothesis of multinormality, that is, F is the cdf of N (µ, Σ) for some (unknown) µ and Σ.
As in the univariate case, the Mahalanobis difference between two location vectors may be used to indicate skewness. Also, two scatter matrices may be used to construct a multivariate measure of kurtosis. Therefore we define our test statistics as follows.
Definition 1 (i) Let T 1 and T 2 be two separate location vectors and let C be a scatter matrix. Then a test statistic for multinormality (to detect skewness) is given by (ii) Let C 1 and C 2 be two separate scatter matrices equipped with the correction factor. Then a test statistic for multinormality (to detect kurtosis) is given by where || · || 2 = T r(· T ·).
Note that as U is based on the distance between two location parameters, it measures the skewness in the direction of , is the variance of the eigenvalues of C −1 1 C 2 and therefore measures the difference on the shapes of scatter matrix estimates. The second part clearly measures the difference in their scales. Note also that using the eigenvalues and eigenvectors of C −1 1 C 2 or equivalently C −1/2 1 , it is also possible to find the direction that maximizes the kurtosis. Since is a descriptive statistic for the kurtosis of linear combination u T x, the vector u that maximizes this quantity gives the direction of maximal kurtosis. Now writing u = C −1/2 1 v the problem reduces to maximizing that is, the so called Rayleigh quotient of C −1/2 1 It is well known that the direction v that maximizes this quotient is the eigenvector corresponding to the largest eigenvalue of C −1/2 1 Tests of multinormality based on location vectors and scatter matrices 7 Since our measures of skewness and kurtosis are based on affine equivariant location vectors and scatter matrices, the invariance of test statistics easily follows.
Lemma 1 The statistics U and W are affine invariant.
As the test statistics are affine invariant, their critical values for different dimensions and different sample sizes may in principle be tabulated. The limiting null distributions may also be used to find approximate p-values, see the following theorem.
Theorem 1 (i) Under the null hypothesis, the limiting distribution of nU is that of (ii) Under the null hypothesis, the limiting distribution of nW is that of where W 1 ∼ χ 2 k(k+1)/2−1 and W 2 ∼ χ 2 1 are independent, The expected values are again calculated for r 2 ∼ χ 2 k . Finally note that, using special choices of location and scatter estimators, it is possible to obtain generalizations of classical Mardia's measures of skewness and kurtosis. If the multivariate measure of skewness is constructed so that T 1 and C are the sample mean vector and sample covariance matrix and T 2 is the one-step M-estimator that uses T 1 and C as initial estimators and weight function v 1 (r) = r 2 , then the resulting skewness measure is easily seen to be Note that this measure is equivalent to that introduced in Mori et al. (1993). The limiting distribution of nb 1,new may be computed using Theorem 1 and it simplifies to that of η 1 U 1 , where U 1 ∼ χ 2 k and η 1 = 2(k + 2)/k 2 . See also Henze (1997).

Limiting efficiencies
In this section, the contaminated normal model CN (µ, Σ) is used for efficiency comparisons. Due to the affine invariance property, it is not a restriction to assume that the observation comes from standard multinormal distribution N (0, I k ) with probability (1 − ) and from N (µ, Σ) with probability . In the following we write G for the cdf of the contamination N (µ, Σ) and let = δ/ √ n depend on n. The null hypotheses case is then given by H 0 : δ = 0, and we consider a contiguous sequence of alternatives H n : = δ/ √ n for a fixed δ > 0. For the contiguity of this sequence, see the proof of Theorem 2.
Theorem 2 (i) Under the sequence of alternatives H n , the limiting distribution of nU is that of η 1 U * 1 where U * 1 has a noncentral χ 2 k distribution with noncentrality parameter (ii) Under the sequence of alternatives H n , the limiting distribution of nW is that of η 2 W * 1 + η 3 W * 2 , where W * 1 and W * 2 are independent noncentral chi-squared variables with k(k + 1)/2 − 1 and 1 degrees of freedom and noncentrality parameters The contaminated normal model is a natural alternative to normal model. It is flexible containing alternatives of different types (skewness and kurtosis) depending on the choices of µ and Σ. Contiguous alternative sequences for skewness and kurtosis are obtained by choices H 1n : µ = (µ 1 , µ 2 , . . . , µ k ) T and Σ = I k and H 2n : µ = 0 and Σ = diag(σ 2 1 , . . . , σ 2 k ). In the next we compare our new skewness measures to Mardia's skewness b 1 using the alternative sequences H 1n . Further, new kurtosis measures are compared to Mardia's b 2 using the alternative sequences H 2n , therefore we need the following results, see e.g Mardia (1974).
(ii) Under the sequence of alternatives H 2n , the limiting distribution of n(b 2 − k(k + 2)) 2 /(8k(k + 2)) is a noncentral chi-squared distribution with 1 degrees of freedom and noncentrality parameter (1983) extented the definition of Pitman efficiency in cases where the limiting distributions of test statistics are of different types. In our case, the asymptotic relative efficiency of U with respect to b 1 is then easily seen to be
Note that the limiting efficiency may now depend on the chosen size and power.
In the next the efficiencies of four new skewness measures are compared to Mardia's skewness b 1 using the alternative sequence H 1n : µ = (µ 1 , 0, . . . , 0) T and Σ = I k . In all three cases, location vector T 1 and scatter matrix C were chosen to be the sample mean vector and sample covariance matrix. As T 2 we use Huber's M-estimator M (0.9) in β 1,M , 25 % breakdown S-estimator with biweight ρ-function in β 1,S and transformationretransformation spatial median in β 1,T . Finally, as pointed out in previous section, the one-step M-estimator with sample mean vector and sample covariance matrix as initial estimators and weight function v 1 (r) = r 2 is used as T 2 in b 1,new . The asymptotic relative efficiencies were computed using (3) with α = 0.05 and β = 0.5. Simulations with 10000 replications were used to compute the noncentrality parameters given in Theorem 2. Since the µ T =(µ 1 ,0) resulting simulated noncentrality parameters appeared to be very unstable for very small µ 1 , we only considered the values µ 1 = 0.5, 1, 2, 3, 4 and 5.
In Figure 1, the asymptotic relative efficiencies are illustrated for different values of µ 1 and for dimensions k = 2 and k = 3. In both cases, the b 1,new test, that is, the generalization of the Mardia's skewness measure is the most efficient one. For small µ 1 the efficiencies of β 1,M and β 1,S are tolerable, but as µ 1 increases the efficiencies decrease. The efficiency of β 1,T is very low for all µ 1 . All efficiencies seem to increase with dimension k.
Since the test based one-step M-estimator with v 1 (r) = r 2 appeared to be the most efficient one, the effect of using one-step M-estimator with weigth function v 1 (r) = r 4 on efficiencies was studied. The resulting test appeared to be the most efficient one and the efficiencies increased drastically with µ 1 . Note, however, that the one-step M-estimator with weigth function v 1 (r) = r 4 is highly sensitive to outliers, therefore this choice yields to very non-robust test of skewness.
Let us now compare our new kurtosis measures to Mardia's kurtosis b 2 using the alternative seguences H 2n : µ = 0 and Σ = diag(σ 2 1 , . . . , σ 2 k ). The asymptotic relative efficiency of W with respect to b 2 is given by where δ 1 and δ 2 are chosen so that P (η 2 χ 2 k(k+1)/2−1 (C W1 (δ 2 )) + η 3 χ 2 1 (C W2 (δ 2 )) > c α ) = P (χ 2 1 (C b2 (δ 1 )) > χ 2 1,1−α ) = β and the critical value c α is given by In our efficiency comparisons, we use three different W statistics. For every statistic, C 1 was chosen to be sample covariance matrix and as C 2 we use Huber's M-estimator M (0.9) in W M , 25 % breakdown S-estimator with biweight ρ-function in W S and one-step M-estimator based on the sample mean vector and sample covariance matrix with weight function v 2 (r) = (k + 2) −1 r 2 in b 2,new . The asymptotic relative efficiencies were computed using (4) with α = 0.05 and β = 0.5. The tail probabilities of linear combinations of chi-squared random variables were computed using the algorithm developed in Field (1993) and the noncentrality parameters given in Theorem 2 were obtained through simulations.  Figure 2 illustrates the efficiencies in cases Σ = σ 2 I 2 and Σ = σ 2 I 3 for different values of σ 2 . Now as k = 2, the b 2,new test is equally powerful with Mardia's kurtosis, but in case k = 3 Mardia's kurtosis performs better than the other tests. These results were, however, expected since only the second part of test statistic W detects this kind of unnormality. For small σ 2 , W M is much more efficient than W S , but as σ 2 increases, these tests become equally efficient.
In Figure 3 the efficiencies are given in cases Σ = diag(1, σ 2 ) and Σ = diag (1, 1, σ 2 ) for different values of σ 2 . In both cases the b 2,new test performs better than the other tests. As σ 2 is small, also W M and W S tests are more efficient than b 2 but again as σ 2 increases, their efficiencies decrease.

Simulation study
In this section, we compare the empirical powers of skewness and kurtosis measures defined in the last section with the powers of b 1 and b 2 through a simple simulation study. The skewness measures were compared using 3-variate samples of sizes n = 50 and 200 from a contaminated normal distributions with = 0.1 and G = N ((µ 1 , 0, 0) T , I 3 ) with selected values of µ 1 . For the kurtosis measures we used G = N (0, diag(1, 1, σ 2 )) with selected values of σ 2 . The test statistics were then computed. Corresponding p-values of skewness measures were obtained using the chi-square approximations to the null distributions. Since the convergence of all kurtosis measures to the null distributions appeared to be very slow even in the case n = 200, the empirical critical values were simulated using 10000 multinormal samples.
Resulting critical values (as well as asymptotic ones obtained using (5)) are given in Table 2 in Appendix B. The process was replicated 1500 times and empirical powers of the level 0.10 tests are illustrated in Figures 4 and 5.
Consider first the simulation results in Figure 4. As n = 50 and µ 1 is small, no big differences may be seen between different tests. As µ 1 increases, the b 1 and β 1,S tests become more powerful than the others, but as n increases, the empirical power of b 1,new is highest as suggested by the limiting efficiencies. The β 1,S test is equally powerful with Mardia's test and β 1,M and β 1,T perform also very well. The sizes of β 1,M , β 1,S and β 1,T are close to the designated size 0.10 in both cases. In the case n = 50, the size of b 1 was below 0.10, therefore a correction factor given in Mardia (1974) was used to speed up the convergence to the null distribution.
The results in Figure 5 show that as n = 50, W M and W S tests are the most powerful ones and perform much better than b 2 and b 2,new . This may be due to the fact that b 2 and b 2,new converge very slowly to the normal distribution. As n = 200, the powers of Mardia's type tests increase, but the tests based on robust scatter estimators are still slightly more powerful. In both cases, the sizes of tests are close to the designated size 0.10.  1, σ 2 )).

A real data example
In our example we consider the same data as in Kankainen et al. (2004), that is, two bivariate data sets of n = 50 observations and the combined data set of n = 100 observations. The data sets are explained and illustrated in Figure 6. We use Mardia's skewness b 1 and kurtosis b 2 to test the null hypothesis of bivariate normality. The results are compared to those given by new skewness and kurtosis measures β 1,S and W S . These measures were included in the comparisons, since in spite of their low limiting efficiencies, the finite-sample powers appeared to be very competitive with those of Mardia's measures.
The values of the test statistics were calculated for the three data sets ('Boys', 'Girls' and 'Combined'). The p-values for b 1 and β 1,S were obtained using the chi-square approximations to the null distributions. Heights of the mothers (cm) Weights at birth (kg) Fig. 6 The height of the mother and the birth weight of the child measured on 50 boys who have siblings (o) and 50 girls (x) who do not have siblings.  Table 1 for the results.
Since the boys in the study had siblings and girls did not, it was expected that observations concerning boys and girls separately came from different bivariate normal distributions. This is seen by using all skewness and kurtosis measures. Moreover, due to the difference between boys and girls, the combined data set was assumed to be a mixture of two different bivariate normal distributions. This is confirmed using our new skewness and kurtosis measures. Mardia's skewness rejects the normality hypothesis on significance level 0.10 and the p-value for β 1,S test is even less than 0.05. When we use Mardia's kurtosis the normality hypothesis is not rejected, but the W S test gives p-value which is less than 0.05.

Final remarks
The paper develops affine invariant tests for multivariate skewness and kurtosis. The tests for skewness use the Mahalanobis distance between two location vectors and the tests for kurtosis are based on the comparison of two scatter matrices. As in the univariate case, these skewness and kurtosis measures are independent and may therefore be combined to obtain a test for multinormality. The resulting test is just a sum of skewness and kurtosis measures introduced here and follows a weighted chi-squared distribution.
In the paper, new tests with several choices of affine equivariant location and scatter estimates are considered in more detail. It is shown that using the so called one-step M-estimators with special choices of weight functions and sample mean vector and covariance matrix as initial estimators in our test constructions yield the generalizations of classical Mardia's measures of skewness and kurtosis. The limiting efficiencies of these measures appear to be very good as compared to the classical Mardia's tests. As illustrated by a simulation study, the finite-sample behavior of tests based on robust scatter estimators is also satisfactory.
(i) The test statistics are affine invariant, therefore we may assume that µ = 0 and Σ = I k . Write then x = ru, where r = ||x|| and u = ||x|| −1 x and r and u are independent.
Let us denote V (F ) = C −1/2 (F )(T 1 (F ) − T 2 (F )). Then the influence function of V (F ) is Write then V = V (F n ) for the estimate corresponding to the functional V (F ). Now according to Huber (1981), as V (F ) is sufficiently regular, Thus under H 0 , V (F ) = 0, and by the central limit theorem, the limiting distribution of √ nV is multinormal with mean 0 and covariance matrix As before, since H(F ) is sufficiently regular, we may write H = H(F n ) as in (6). The elements of √ nH are thus normally distributed and the asymptotic variances and covariances are given by Note that the diagonal and off-diagonal elements of H are independent. Consider now the test statistic, which may be rewritten as Write then b 2 = 2ASV (h ij ; F ) and d 2 = ASC(h ii , h jj ; F ). Since √ nh ij ∼ N (0, b 2 /2) and the elements h ij , i = j, are independently distributed, the limiting distribution of 2n i<j h 2 ij under H 0 is that of η 2 W a , where W a ∼ χ 2 k(k−1)/2 . The diagonal elements h ii are not independent, but we may write h = Az, where h = (h 11 , . . . , h kk ) T , z = (z 1 , . . . , z k+1 ) T , √ nz ∼ N (0, I k+1 ) and Proof (Theorem 2) (i) In the proof, we apply LeCam's third lemma. Write the density function of contaminated normal distribution as f ( Let us first show that = δ/ √ n is contiguous to the null hypothesis: The optimal likelihood ratio test statistic for testing H 0 against H n is where E(k i ) = 0 and V ar(k i ) = σ 2 is bounded. Thus, by LeCam's first lemma, the sequence of alternatives is contiguous to the null hypothesis. As in the proof of Theorem 1, write then Hence by LeCam's third lemma we have that under the alternative sequence, , η 1 I k ) and the limiting distribution of nU is that of η 1 U * 1 where U * 1 has a noncentral χ 2 k distribution with noncentrality parameter (ii) As in the Proof of Theorem 1, decompose nW as If W * a is now computed under H n , then the noncentality parameter of W * Notice next that using the transformation h = Az defined in the previous proof we have that, Ez = A − Eh, where (k + 1) × k matrix and the noncentrality parameter of W * 1 = W * a + W * b simplifies to The above expectations may again be computed using LeCam's third lemma. The resulting noncentrality parameter for W * 1 is then Similarly, for W * c we have the noncentrality parameter The result now follows, since IF (x; H, F ) = IF (x; C −1 1 C 2 , F ).
Appendix B: The empirical critical values for kurtosis measures