Gear Classification and Fault Detection Using a Diffusion Map Framework

Gear classification and fault detection using a diffusion map framework. Abstract This article proposes a system health monitoring approach that detects abnormal behavior of machines. Diffusion map is used to reduce the dimensionality of training data, which facilitates the classification of newly arriving measurements. The new measurements are handled with Nyström extension. The method is trained and tested with real gear monitoring data from several windmill parks. A machine health index is proposed, showing that data recordings can be classified as working or failing using di-mensionality reduction and warning levels in the low dimensional space. The proposed approach can be used with any system that produces high-dimensional measurement data.


Introduction
Modern industry monitoring systems produce high-dimensional data that are difficult to analyze as a whole without dimensionality reduction. The goal of the study is to estimate whether the proposed dimensionality reduction scheme effectively distinguishes working gears from broken ones. System health management has multiple sensors that measure vibration, temperature and oil properties. The early detection of anomalous gear behavior using this sensor data reduces the risk of severe damage. Sensor data are then used to monitor the health of the system, to detect anomalies and to predict problems (Chandola et al., 2009, pp. 15-16).
Anomaly detection methods try to find deviant or atypical measurements from a large data mass (Chandola et al., 2009). In this study known anomalies are in the training so that they can be contrasted with the normal behavior. An ideal indicator would tell with certainty that a machine works or is going to fail. However, in reality the non-working state is ambiguous and it can be difficult to classify.
Spectral dimensionality reduction methods include principal component analysis (PCA), kernel PCA, multi-dimensional scaling (MDS), Laplacian eigenmaps, isomap and locally linear embedding (LLE). These methods facilitate the analysis of highdimensional data by mapping the high-dimensional coordinates to a lower dimension. The spectral approach also leads to the concept of spectral clustering (Bengio et al., 2006;von Luxburg, 2007). Spectral methods have been used to analyze system operational states (Pylvänen et al., 2009), motor fault detection (Parra et al., 1996) and anomaly detection for spacecraft (Fujimaki et al., 2005).
Traditionally, frequency analysis has been used to determine the health status of gears or bearings. This study uses frequency information directly from sensors embedded to the gear. Analyzing the frequency using diffusion map and thresholding data produces a classification of working and failing gears.
Recently, wavelet decomposition has been used in a similar fashion to produce a frequency spectrum growth index with a specialized threshold function (Wang et al., 2009). Furthermore, lifting wavelet packet decomposition has been used to extract node energies at the third level of the decomposition as features. This way fuzzy cmeans can be trained using normal and failure data. The state of incoming new data is then determined relative to the found clusters (Pan et al., 2010). Another health monitoring index is the average probability index (API), which utilizes empirical mode decomposition for time-frequency features and hidden Markov model for stochastic estimation (Miao et al., 2010). Support vector data description has been used for early fault detection and fault diagnosis (Wang et al., 2011). Wavelet-based filtering has also produced health condition indicator for cooling fan bearings (Miao et al., 2012). Finally, a study using diffusion map has been made concerning machine condition monitoring (Huang et al., 2013).
This study uses diffusion map, which is another spectral dimensionality reduction method. Its mathematical foundation is random walk on Markov transition matrix of the graph of the data (Coifman and Lafon, 2006a). Diffusion map can be classified as a nonlinear distance-preserving dimensionality reduction method that preserves global properties (van der Maaten et al., 2009). Furthermore, the Nyström method is used to extend new points, although newer methods such as geometric harmonics exist (Fowlkes et al., 2004;Coifman and Lafon, 2006b).
This study presents a way to detect faults in windmill gears by devising an index to describe how close to the faulty state a gear is. Besides gear fault detection, this method can also be used with other collections of high-dimensional time series data.

Methodology
This method trains a diffusion map that describes the good and bad state of the gears in order to create a health management index that predicts faults in gears. It then extends newly arriving test measurements to the model and classifies the gear as good or bad. Most of the preprocessing is domain specific, but the dimensionality reduction and classification, that are more universally applicable, are presented here. Figure 1 introduces the overall data processing architecture. The equations are in matrix form. The details behind them are discussed elsewhere (Nadler et al., 2008;Fowlkes et al., 2004;Belongie et al., 2002).

Training dimensionality reduction
The underlying assumption in manifold learning methods is that the data is situated on a lower-dimension manifold in the high-dimension measurement data (Chandola et al., 2009, p. 37). We try to create a function that maps the behavior of high-dimensional points to lower dimensions. Then new measurement points are mapped from high dimensions to this low-dimensional presentation (Coifman and Lafon, 2006a). Let x i ∈ R n , i = 1 . . . N be a measurement in n-dimensional space. The kernel matrix W includes the pairwise distances of these points. The used kernel is the Gaussian kernel using Euclidean distance measure. This is the most computationally intensive step because each point is compared to other points.
Determining is a problem in itself. The chosen estimation is the median of the distances between the points, (Schclar et al., 2010). Depending on the problem, changing this parameter might give more meaningful results.
The diagonal matrix D ii = N j=1 W i j has the degree of each point on its diagonal, the other elements are 0. The degree of a point is the sum of weights that connect to other points. This is equal to the sum of kernel matrix rows.
The rows are normalized by these sums. The result can also be understood as transition probabilities between points. These probabilities are collected in matrix P.
However, future calculations on P become easier if a similarity transformation symmetrizes the matrix.P These last two steps can be combined. Substituting P with D −1 W yields Such normal matrix is decomposed as This decomposition is done using singular value decomposition (SVD). The columns of matrix U contain eigenvectors u k of matrixP. Likewise, the diagonal of Λ contains its corresponding eigenvalues. However, the real interest is in the eigenvectors of the transition matrix P. The eigenvalues of P are the same, but the eigenvectors are obtained from Recall that the eigenvalues λ k are in the diagonal of Λ. The eigenvector v are columns of V. An original data point x i has a corresponding value on the ith row of the eigenvector. For example, v 2 (x 236 ) would signify the second eigenvector and its 236th row, corresponding to the 236th sample x 236 of the original dataset.
The diffusion map itself is a function in the form Ψ : R n → R d , when d n. We multiply the eigenvectors and eigenvalues to get the diffusion coordinates of the training points in the corresponding matrix: The first eigenvector is constant, so only the following eigenvectors and eigenvalues are used. This way we get the following function that maps the original data points to a lower-dimensional space.
It has been shown that the diffusion distance in the original space equals to the Euclidean distance in the diffusion space (Coifman and Lafon, 2006a). Thus, the distance measurements in the diffusion space are actually meaningful and can be used in further analysis in this lower-dimensional space.
Later analysis uses only the first few diffusion coordinates. Fast decay of eigenvalues leaves most of the diffusion coordinates rather small compared to the first few. The overall reconstruction of P does not differ much from a reconstruction that uses only the first coordinates. These coordinates capture most of the differences between the data points (Coifman and Lafon, 2006a;Nadler et al., 2006).

Extension of new measurements
New measurements that are not part of training are extended to the model with Nyström method (Fowlkes et al., 2004;Belongie et al., 2002). The features selected during training are the only ones needed. These new measurements are normalized using the same normalization as during training.
Let a new data point be y j ∈ R n . Then the distance between the new points and each training point are collected in a matrixW. This function uses the same as the one in training phase.W Diagonal matrixD ii = N i=1W i j contains the column sums ofW. Now we can create the transition probability matrix B.
The following matrix multiplication produces new eigenvectors for the new point. The eigenvectors V and eigenvalues Λ are the same as in training.
These new eigenvectors now extend the new point to the diffusion coordinates.
Ψ =VΛ The last two steps can be combined.
MatrixΨ now contains the extended eigenvectors in its columns for the new points y j .

Classification of new measurements
Low-dimensional presentation of the data facilitates clustering. The clustering approach here is spectral clustering and it reveals the normal and anomalous areas (von Luxburg, 2007;Kannan et al., 2004). Any other clustering, for example k-means, can be used if they provide better results (Ng et al., 2001;Meila and Shi, 2000;Shi and Malik, 2000). The used algorithm simply tests whether the sample is to the left or to the right of 0 on the dimension corresponding to the 2nd eigenvector. This provides a classifier that discriminates two states: working or broken.

Health management index and warning levels
The 2nd eigenvector, corresponding to Ψ 1 , is the health management index. The working machines are above 0 and the failing machine below it. For more warning levels, different thresholds θ can be applied the coordinate Ψ 1 corresponding to the 2nd eigenvector. There are three warning levels: note, warning and damage. These describe the severity of the problem in the gear.
Note means that there is an unusual measurement in the data, but the gear is still in operational state. The sample is not inside the good cluster but is still closer to it than to the bad.
Warning level is at θ warning = 0. It describes the border between good and bad clusters. The sample is closer to the bad cluster. This can be seen as a predictive sign that the gear has problems. If the bad cluster goes beyond 0, the middle point between the two clusters can be used.
Damage level is at θ damage = max{Ψ 1,bad }. This level means that the sample is within the bad cluster.

Results
This study uses a dataset consisting of gear monitoring recordings of multiple features. These data are collected during normal operations from windmill gears from nine different windmill parks. Naturally, the operating conditions vary among the parks. The dataset consists of recordings of 18 good and 20 bad machines labeled by domain specialists. As stated, the gears come from different locations where the operational environment varies. However, each gear is of the same type and includes same features. Two of the gears are discarded because they contain empty data due to instrument failures. The dataset is divided to training and testing sets. The training set includes five good and five bad gears. The testing set includes the rest of the gears.
The sampling rate of the measurements is averaged measurement once in half an hour. Each of the windmill gears were monitored for months, which brings the sample size for each gear to about 1500 per month.

Preprocessing
The data are sampled at an approximate frequency of one sample per 30 minutes representing the average of measurements during that time. The recordings last for months. Because there were times when no data were available, linear interpolation is used. This data formed the samples × features matrix. Instrument failures give unrealistic or missing measurements. Because it is difficult to compare such measurements to ones that do not have unrealistic values, measurements containing missing values are discarded. However, this process might lose some usable information.

RPM filtering
Samples whose rotations per minute (RPM) value is too small are filtered out, because only higher values represent the actual working state of a gear. Lower values are associated with idle state, and those measurements are not interesting when monitoring actual working gears. The RPM values are clustered into two clusters using k-means clustering, resulting in clusters of samples named RPM cluster 1 and RPM cluster 2 . The threshold value threshold RPM = max{min{RPM cluster 1 }, min{RPM cluster 2 }}; is calculated and all the samples whose RPM value is below this threshold are removed.

Data scaling
All the data are normalized with logarithm. Other normalizations, like dividing by maximum or dividing by norm, do not give as good separation for this dataset.

Feature selection
A preliminary feature selection in the original feature space is done. One feature is left out at a time. The average Mahalanobis distance between the good and bad machines shows how much that feature describes the difference. The features with smallest averaged Mahalanobis distances are most useful. Small distance reveals that leaving the feature out affects negatively the separation of good and bad. Thus, using the feature separates the groups well. The number of reduced features is determined by including the most separating features.
There are 136 features that describe oil pressure, oil particles, RPM, and the frequency of multiple vibration sensors in three dimensions. The initial feature selection reduced their number to 20. One feature is the oil pressure, five from the planetary stage of the gear, seven from the intermediate stage, six from the high speed stage, and one is the RPM. These features separate more clearly the two groups from each other.

Classification results
Five good and five bad gears were used in training. The original data has 136 features, 20 of which are used after preliminary feature selection. The diffusion map and the Nytsröm extension were performed using the 20 most separating features. All the gears, including training gears, were then tested as new incoming data. Table 1 shows that each of the failing test gears had alerts. Table 2 shows that no working gear had warnings, although some of them had notes. The alert percentage tells how many per cent of samples during the monitoring time gave alerts. The letter combinations AH, CA, ET, FE, LS, MB, OO, PH and QU refer to the nine windmill parks where the data were collected. The numbers following the letters identify the individual windmills. These tables show that all failing gear units had alerts during the monitoring time. None of the working gear units had alerts. This means that there were no false alarms in the working gears.
The following figures illustrate the behavior of failing gears. Normal state does not produce figures of interest because there are no alerts. Figure 2 shows how the newly incoming data is situated in low-dimensional space. This figure is mainly for 7   Figure 3 shows the alert index, while Figure 4 indicates the accumulating number of alerts. The images show the length of the measurement, from February to January. These two indicators should be of use for technicians monitoring the gears. The alerts themselves are in Figure 5. Figures 6, 7, 8, 9 show the same measurements for another gear. It breaks down more slowly but the high number of notes can be seen. The misclassification of good machine FE01 as bad is probably because of the data interpolation. Further domain analysis revealed that there actually had been a small problem with the gear, and thus raises the question whether it is labeled correctly. The misclassification of bad machine ET104 as good can be explained. Firstly, there are no training gears from this location. ET104 is too close to the good gears in diffusion space. Secondly, domain analysis reveals that this gear has only a small problem. Better training data and detailed labeling could prevent this kind of misclassification. Vastly different operating environment and behavior of gears in ET1 might also cause this misclassification.

Conclusion
The goal of this study is to estimate the usefulness of dimensionality reduction methods in windmill gear fault detection. This goal is met since almost all the gears are classified correctly according to their ground truth labels. The state of failing machines can be monitored using note, warning and damage indicators. This proves that the training is successful and separates the good gears from the bad. More importantly, measurements from new gears can be extended into the model.
This study used diffusion map as the dimensionality reduction method. Non-linear methods, such as principal component analysis, could be used. However, diffusion     map has the additional kernel parameter to control the compactness of the clusters. The choice of the dimensionality reduction method is up to the users and their specific needs. The training can be done within an hour. Memory usage is high during the training because it is dependent on the sample size. With such a low sampling rate (1 per 30 min) there is no problem with computational time when adding new data points.
The challenges caused by spectral methods in general need some addressing. The proposed method works because, after slight filtering, the good and bad gears are separable in the lower dimensions. However, the high computational cost could be a problem in a more real-time system. Moreover, the classification of a gear time series itself is an ambiguous concept. However, this study shows that gears in normal condition and gears that are going to break down behave differently and can be separated from each other.