Tree Species Identification Using 3D Spectral Data and 3D Convolutional Neural Network

In this study we apply 3D convolutional neural network (CNN) for tree species identification. Study includes the three most common Finnish tree species. Study uses a relatively large high-resolution spectral data set, which contains also a digital surface model for the trees. Data has been gathered using an unmanned aerial vehicle, a framing hyperspectral imager and a regular RGB camera. Achieved classification results are promising by with overall accuracy of 96.2 % for the classification of the validation data set.


INTRODUCTION
This study is continuum for [1], where the individual tree detection and classification pipeline for the hyperspectral and point cloud data is clearly described.We are interested to see if deep learning methods could improve or simplify the data processing chain for identifying the species of individual trees.
There exists plenty of research concerning tree species identification, but it is mainly concentrated on large scale remote sensing, which uses forest stand and plot level data.For example in Scandinavia combination of airborne laser scanning and aerial images is used in forest inventory [2].There are less studies and applications for the tree species identification from unmanned aerial vehicles (UAV) using hyperspectral sensors.If hyperspectral data has been used for tree species identification, the platform for data gathering has been manned aircraft or satellite.
As in [1], these remote sensing studies use quite traditional feature extraction and selection methods before classification.Deep learning methods have dramatically improved performance of pattern recognition [3].Especially This research has been co-financed by Finnish Funding Agency for Innovation Tekes (grants 2208/31/2013 and 1711/31/2016) deep convolutional neural networks (CNN) have provided breakthroughs in image, video and audio processing.If we consider hyperspectral data, it seems that they should handle hyperspectral data combined with 3D data as well.There is currently increasing number of research, which applies CNN's and 3D CNN's to hyperspectral imager [4,5].
In this paper we first test performance of 3D CNN for tree species classification.Neural networks has the nature of being a black box, that doesn't reveal how it has reasoned its results.However, while doing classification we can calculate saliency maps, which will give us hints on which parts of the input data are relevant for the CNN [6].
This paper has the following structure.First, in Section 2 we describe the used data set, its acquisition and preprocessing.Then the structure and functionality of the used 3D CNN is described.In Section 3, the results are presented and Section 4 includes the conclusion.

Data gathering and preprocessing
The research data is the same as reported in [1].The collected remote data was captured in Vesijako research forest area in the municipality of Padasjoki in southern Finland (approximately 61 o 24'N and 25 o 02'E).Area has been used for forestry research by Natural Resources Institute of Finland.The area contains experimental plots with different research setups.All the trees with the diameter of at least 50 mm at the breast-height were measured and estimated with various metrics, such as the tree species, diameter, height and volume.Locations of these trees were collected with GPS.
In total, 4142 trees were selected for further study.The data set contained three most common species of Finnish forests: scots pine ( Pinus sylvestris, 2821 samples), norway spruce ( Picea abies, 742 samples) and silver birch (Betula bendula, 579 samples).These selected trees were compared to aerial orthoimage mosaics to ensure that the GPS coordi-nates were in the centres of the treetops.
The used remote sensing data was a combination of two data modalities captured by the UAV remote sensing system, which belongs to Finnish Geospatial Research Institute.System consist of a Tarot 960 hexacopter and a Pixhawk autopilot.System is capable of carrying 3 kg payload at maximum.Average flying time of the system is 30 minutes.As a payload, we had a tunable Fabry-Pérot inteferometer based spectral imager (FPI) and an ordinary RGB camera, the Samsung NX1000 (RGB).Flying height from ground level varied between 83-94 meters.
The FPI imager captures raw data, which is processed to radiance based on the radiometric laboratory calibration [7].The geometric imaging model was then determinated.The model includes both the interior and exterior orientations of the images.The digital surface model is calculated by dense image matching.Because of the slight variations between bands in the FPI camera, we had to apply registration of the spectral bands of FPI images.To make data cubes and further mosaics radiometrically homogenous, the radiometric imaging model has to be determined [8,9].The hyperspectral image mosaic is calculated after the radiometric model is applied to each cube.The detailed radiometric and geometric processing of the data set is explained in [1].Finally, the spectral mosaics with 33 bands and digital surface model (DSM) both with 10 cm GSD are created.
For the tree species identification, 4 × 4 meter windows surrounding each treetop were extracted.The windows contained both DSMs as rasters and spectral cubes.For each treetop, the extracted DSMs were scaled by the minimum value of the whole DSM.The DSM and spectral cube for each treetop were concatenated in spectral axis to unified data cubes (41 × 41 × 34).In Finland there exist laser scanned nation wide ground surface elevation model, which is freely available.Thus, canopy surface model could have been calculated, but it isn't actually needed, because we are only using height of the treetops.Figure 1 illustrates average treetops for each species.We can see that there are slight differences between the shapes.Pine's treetops are quite symmetric.Spruce's treetops are more of ellipses and aligned on north-west to south-east axis.Birch is more irregular, but its leaves and branches are towards south where the Sun shines.

Convolutional Neural Network
Originally CNN's were presented by LeCun and Bengio [10].The idea was to tackle feature extraction and selection problem in fully connected feed-forward networks.The network uses a convolution matrices.Traditional neural network layers are usually based on consecutive dense (fully connected) neurons.In convolutional neural networks, there exists at least one convolution operation in the network.We applied quite simple structure to our CNN, using four types of layers: 3D convolutional, pooling, dropout and fully connected layers.Our network's structure is presented in Table 1.
In general, convolutional layers have trainable filters, which use convolution operations to extract features.In our implementation, the convolution layer uses activation called rectified linear unit (ReLU).ReLU has advantages of being efficient with non-linear relations and having less vanishing gradient problems during the network optimisation compared to other popular activation functions [11].Pooling layers, which usually follow convolutional layers, are non-linear downsampling functions, which reduce dimensions of input data.Dropout layer is a regularization method for reducing overfitting in the neural network by introducing noise to the network.Flatten layer translates data to one dimensional stack.A dense layer is a fully connected layer, which consists of parallel neurons which are connected to all previous layer's outputs.Weights of the connections and activation functions determine which features are correlating with different tree species.The last dense layer is activated with softmax function, whose output is the final classification.
If the amount of data is limited, meaning that the number of training samples is low, then there is option to apply data augmentation.Basically this means that we will generate new training data from existing ones.In this study we fivefold our training data by using simple rotation and flipping operations.Selected training data was flipped both horizontally and vertically.Data was also rotated 90 degrees to left and right.
In machine learning structures like neural networks are so called "black box" solutions.We don't have clear vision how data is classified.It is reasonable to ask, is the classification based on real feature of wanted object or something secondary such as ground type in tree species recognition.Luckily there are methods to see where network is putting weight in classified data.It is possible to calculate gradient over layers from output to input.This way to get actually image, where areas with higher values contributes most to classification result.These maps are called saliency maps.
Stochastic gradient decent was used to tune weights between layers.We used categorical cross entropy as a loss function, which basically calculates cross entropy between categories probability distributions.Primary metric for model evaluation was accuracy where T P is true positive, T N is true negative, F P is false positive and F N is false negative classification result.CNN's were trained by using IBM PowerAI platform which includes two Tesla V100-SXM2 16 GB GPU units.Tensorflow was used as a computational backend [12].All machine learning phase coding was done using Python 3.6 and Keras library [13].Saliency maps were calculated using Keras-vis library [14].

RESULTS
Altogether 3311 trees were randomly selected for the training of the 3D CNN.After data augmentation there was 16555 samples.Training was performed with batch size 128 and with 100 epocs.Training took two and half hours (approx.88 seconds/epoch).Results were validated with 831 samples, which weren't included in training set.
Figure 3 shows that accuracy of trained model is relatively high.It seems that we can with quite large confidence identify tree species from each other.Overall accuracy for classification of validation set was 96.2%, which is higher with earlier results achieved in [1].Producer accuracies were for each tree species were 96.2% (Pine), 86.6 % (Spruce) and 98.2 % (Birch).Respectively users accuracies were 96.3 %, 83.8 % and 95.7 %. Figure 4 is presenting average saliency maps in spatial domain over all input bands of validation data.It seems that most of the important features are handling data surrounding tree top.This is shown more clear in the figure 5, where figure's 4 maps are rendered over validation sets average 3D treetops.Thus, we can be quite confident that, at least in spatial domain, tree top's shape is relevant feature in classification.
In spectral domain most characterising features seems to be located between wavelengths from 600 to 720 nm. Figure 6 presents average salience in each spectral band.It can seen that there is differences between tree species.For example birch has lower saliency in 560 nm and higher in 700 nm than coniferous trees.
If we consider individual trees, it seems that classifying is working quite efficiently.In figure 7 there is one tree of each species from the validation set.It can be seen that for example pine in this case doesn't have very clear treetop, but classifier is able to find one and saliency map seems to confirm the result.

CONCLUSIONS
In this paper we demonstrate how 3D hyperspectral data can be analysed using 3D convolutional neural networks.As a concluded result we can see that even with quite simple 3D CNN, it is possible to create network, which has good capability to classify single trees based on their shape and spectral features.
In classical machine learning one of the most time consuming thing for data analysist has been feature extraction and selection.In case of convolutional neural network this phase is now automated.After preprocessing there is quite limited amount of things to do, if you want to utilize trained network.Network training itself is time consuming, but before hand trained network can deliver results almost in realtime.In our case training took two and half hours.
Compared to earlier work [1] we actually used all captured test areas.In original paper one area was left behind, because of the poor quality of image block.Based on that, our results seems to show that trained 3D CNN is actually more robust as a classifier than methods used in previous study.
It is obvious that more studies is needed.Used network structure is one of the most simple ones.With more sophisticated structures it might be possible to improve learning results.One of the tested things in the future is, how general trained model actually is.If we have another data set, can we have similar classification results?We used quite limited amount of data augmentation.Even tough overfitting wasn't One potential research question is that how many bands and what GSD is needed, if we want to gain similar results.Our next steps include more augmented data to training such as scaling, adding noise, chancing lightness and adding more rotation to see if we could detect trees also with lower resolution.
The used data set has more parameters for single trees (height, estimated volume, etc..) and there is also 300 fixed radius (9 m) sample plots, which have been used for area based forest inventory.In near future we will also test how well 3D CNN approach is able to estimate these parameters.
Our consortium has ongoing research project where our aim is to produce real time processing for the DSM and hyperspectral mosaics.This combined with pre-trained CNN classifier, could be significant tool to provide forest tree identification and parameter estimation without wasting time on massive preprocessing.

Figure 2
Figure 2 represents how spectral distribution diverges to different wavelengths for each tree species.The line in the figure represents the average spectra for each treetop.Quite obvious differences can be found between birches and Nordic coniferous trees.Birches have stronger reflection in green and infrared regions.Birches have steeper spectrum at red edge area.

Fig. 2 .
Fig. 2. Histogram spectra of each tree species.Black line is average spectrum.

Fig. 4 .
Fig. 4. Average saliency maps in spatial domain over all input bands of validation data for each tree species.Brighter pixel indicates that band is probably more meaningful in classification.

Fig. 5 .
Fig. 5.Here figure's 4 saliency maps are rendered over validation set's average 3D treetops.It can be seen that maps surround quite well treetops.

Fig. 6 .
Fig. 6.Average saliencies of spectral domain for each tree species.Higher value indicates that band is probably more meaningful in classification.

Table 1 .
Structure of our experimental CNN.