Instance-Based Multi-Label Classiﬁcation via Multi-Target Distance Regression

. Interest in multi-target regression and multi-label classiﬁcation techniques and their applications have been increasing lately. Here, we use the distance-based supervised method, minimal learning machine (MLM), as a base model for multi-label classiﬁcation. We also propose and test a hybridization of unsupervised and supervised techniques, where prototype-based clustering is used to reduce both the training time and the overall model complexity. In computational experiments, competitive or improved quality of the obtained models compared to the state-of-the-art techniques was observed.


Introduction
Applications of supervised learning, where models are constructed to predict multiple target variables at once, rapidly increase their popularity.This research field within machine learning is referred as Multi-Output Learning [1], which can be divided into two main categories: i) Multi-Label Classification (MLC); and ii) Multi-Target Regression (MTR).In MLC, an instance is associated with multiple labels contrary to the conventional Single-Label Classification (SLC), where a single label is determined.There exists a plethora of methods for MLC, which can be divided into two main groups [2]: i) algorithm adaptation; and ii) problem transformation.In general, the distinction is made based on whether the classifier or the MLC problem itself is being modified.In algorithm adaptation, a specific classification method is tailored so it can be applied to MLC directly.The problem transformation methods modify the multi-label problem to be suitable for any single-label classifier.
A supervised distance-based method, the Minimal Learning Machine (MLM) [3], has been shown a promising performance in many experiments [4,5,6,7].Lately, MLM and the Extreme MLM (EMLM) [7], were identified to have appealing characteristics for MTR with problem transformation [8].It has been demonstrated that the MLM avoids over-fitting for high-dimensional input spaces in classification [7] and regression [9,4].Therefore, tuning the MLM's only hyperparameter, the number of reference points, is mostly an issue of balancing the model complexity and the generalization capability in a straightforward manner: increasing the model complexity (the number of reference points) increases the generalization accuracy.However, increasing the accuracy of the model in this way comes with a cost, since the computational complexity of the training phase behaves quadratically with respect to the number of reference points and linearly with respect to the number of observations [3].
In [10], it was shown that clustering application to MLC can be useful, especially for a large number of labels.In terms of high-dimensional output spaces, large-scale MLC problem arise from the application domains such as of image annotation [10] and text categorization [11].In this paper, our aim is to adapt MLM to MLC and reduce the complexity of training and resulting models using clustering of input data.First, a straightforward Multi-Label MLM formulation is introduced based on the Nearest Neighbour MLM (NN-MLM) [6].Then, the Clustering-Based ML-MLM (CBML-MLM) is proposed by utilizing the prototype-based clustering [12,13].Note that this technique readily supports federated learning scenarios [14].

Multi-Label Minimal Learning Machines
Suppose we have N input data points X = {x i } N i=1 , where x i ∈ R M , and the corresponding 1-of-L encoded output vectors Y = {y i } N i=1 , y i ∈ {0, 1} L .Suppose we have selected a subset of the so-called reference points R = {r k } K k=1 from X, and the corresponding subset of points T = {t i } K i=1 from Y.In the MLM, the main idea is to learn a linear regression between the distance matrices D x ∈ R N ×K and D y ∈ R N ×K , where D x(i,j) = d(x i , r j ) and D y (i,j) = d(y i , t j ) for the Euclidean distance d(•, •).The Multi-Target Distance Regression (MTDR) model is then formed by utilizing the Ordinary Least Squares (OLS) ( [3]) where the K ×K matrix B contains the coefficient for the MTDR model.As can seen from ( 1), the distance-based approach is especially beneficial for problems with high feature spaces and large number of classes, because M and L only affect construction of distance matrices but not the complexity of learning.Therefore, in off-the-shelf scenario, the distance-based regression can be more efficient and generalize better than deep learning models [5].
In the MLM prediction, for a given input x, distances to the reference points R are computed and the MTDR model is used to predict the output space distances δ = (δ 1 , ..., δ K ) T to the reference points T. In [6], it was proved for SLC that assigning class label of the nearest output space reference point as a prediction y, so that y = t q where q = argmin k δ(k) , is an optimal solution to the multilateration problem [3].Note that solving the multilateration problem for the MLC problems with testing all the label combinations would be very complex and time-consuming, because the number of different label combinations could be huge for hundreds or even thousands of labels.
In here, we extent the NN-MLM approach straightforwardly to MLC so that the predicted set of labels is assigned directly from a set of labels associated with the nearest predicted output space reference point.We assume that the set of labels associated with the predicted nearest neighbour is a reasonable approximate solution to the multilateration problem.Furthermore, because this kind R k , T k ← select subsets from X and Y according to I k 4: B k ← solve Eq. ( 1) for D xk and D y k of an approach relies on assigning a set of labels to an instance directly from a reference point, use of full MLM would ensure that all the possible label combinations occured in the training data are contained within possible predictions.
We will refer to this direct MLC algorithmic adaption as Multi-Label MLM (ML-MLM).In the categorization of MLC algorithms, the ML-MLM approach can be referred as an instance-based multi-label classifier similar to the ML-kNN method [15], where predicted set of labels is computed with the Maximum A Posteriori (MAP) method from the sets of labels related to the k nearest neighbours in the input space.ML-MLM identifies the nearest neighbour via the distance regression model while ML-kNN uses directly the input space.
Since ML-MLM selects all the data points as reference points, computational complexity of ML-MLM's training is O(N 3 ).To improve this high training cost, we propose a novel Clustering-Based ML-MLM (CBML-MLM) approach with reduced time complexity.However, we still will utilize all the data points as reference points to again ensure that the full diversity of the label combinations is preserved.The training algorithm for the proposed method is given in Algorithm 1 and the prediction phase is given in Algorithm 2. The training requires a prototype-based clustering algorithm f c for partitioning the input space to local subsets.Prototype-based clustering methods such as K-means++ and K-spatialmedians++ [12] could be used.Both of these methods have linear time complexities and can be implemented in parallel for large-scale data sets [16,13].
In the training phase, K c local MTDR models are trained where each cluster's points are selected as reference points.For each local MTDR model, training data is formed from the union of data points belonging to the nearest K clusters.For K = 1, the MTDR training data is the same as the local set of reference points, and for K = K c , the whole data is utilized as training data.Note that the size of the final model is independent of parameter K. Similar to ML-MLM, CBML-MLM spends most of the training time solving the OLS from Eq. (1).For K = 1, the time complexity for training each cluster-wise model is Algorithm 2: CBML-MLM prediction Input: Input x, a set regression models {B k } Kc k=1 , cluster-wise input space reference points {R k } Kc k=1 , cluster-wise output space reference points {T k } Kc k=1 , cluster prototypes {c k } Kc k=1 .Output: Set of labels y. 1: where * N ).Note that if N * << N , the CBML-MLM with K = 1 is clearly faster to train than the CBML-MLM with K = K c .Moreover, if we have N * << N , CBML-MLM is significantly faster to train than ML-MLM.In the prediction phase, the cluster prototypes are used for selecting the local MTDR model for classification.

Results
We selected six MLC data sets from http://mulan.sourceforge.netand utilized given training and testing data set division in order to be able compare our results to the results given in [17].For the selected data sets, number of observations varied from 593 to 43907, number of features varied from 72 to 1001, number of labels varied from 6 to 374, and label cardinality varied from 1.1 to 4.4.We scaled all the input features to the range of [0, 1].We selected ML-kNN as a main baseline, and fixed k = 10, similar to many other works [10].For CBML-MLM, we used K-spatialmedians++ [12] with 100 repetitions as a clustering method, and selected K c = 10 and K = {1, 10}.We did not perform any hyper-parameter optimization for CBML-MLM.In the experiments, the largest cluster size normalized by the number of training observations varied from 0.13 to 0.26.We used the existing MATLAB implementation of the ML-kNN [15] given in http://www.lamda.nju.edu.cn/.The proposed methods were implemented with MATLAB as well.For the evaluation of the classifiers' performance, we used two uncorrelated and recommended measures from [18]: hamming loss and accuracy (or example-based accuracy).
In Table 1, results for the experimented methods are shown in columns two to five.Moreover, in [17], Random Forest of Predictive Clustering Trees (RF-PCT) was the best performing method in the comparison.The results regarding RF-PCT for hamming loss and accuracy are shown in the last column.The best Data set ML-kNN ML-MLM CBML-MLM K = 1 CBML-MLM K = 10 Best from [17] Emotions  results are emphasized in bold for each data set.The training time complexities of RF-PCT and ML-kNN are given in [19].In Table 1, these are represented with respect to the data size.In terms of the evaluated metrics, CBML-MLM and ML-MLM methods clearly outperform the ML-kNN baseline.Moreover, compared to the best performing method in [17], CBML-MLM and ML-MLM have better accuracy than RF-PCT for four data sets, and for the Scene and Corel5k data sets, the accuracy difference is significant.In terms of hamming loss, the CBML-MLM and ML-MLM have similar performance to RF-PCT.This means that in particular CBML-MLM with K = 1 provides learning efficiency, locality of models and therefore natural data parallelism, and high accuracy.Increasing size of the training data for the local MTDR models with the choice K = K c seems, only in some cases, slightly improve the CBML-MLM performance.

Conclusions
In this paper, we adapted and tested the minimal learning machine (MLM) in multi-label classification (MLC) problems for the first time.Experimental results showed that a state-of-the-art performance in MLC was reached with the proposed techniques.We adapted the nearest neighbor MLM (NN-MLM) approach to MLC, because in this way, the label correlations can be taken into account.Moreover, we showed that clustering can be applied to reduce the MLM's training time and model complexity with only a small sacrifice in accuracy and hamming loss.For the largest data set, this sacrifice was smallest which suggests that the proposed clustering-based MLM approach would be especially suited for large-scale MLC problems.As future work, we aim to cover, both methodologically and experimentally, the full scope of problem transformations in multi-target regression and classification problems using distance-based machine learning techniques.

Algorithm 1 :
CBML-MLM trainingInput: Input data X, output labels Y, #clusters K c , #clusters for distance regression fit K, prototype-based clustering algorithm f c .Output: Set of regression models {B k } Kc k=1 , cluster-wise input space reference points {R k } Kc k=1 , cluster-wise output space reference points

k=1 5 :
D xk , D y k ← compute cluster-wise distance matrices for {x i | i ∈ k∈ Ĩk I k } and R k , and for, {y i | i ∈ k∈ Ĩk I k } and T k 6: predict distances with a local regression model 4: q ← argmin k δ(k) // identify nearest neighbour with predicted distances5: y ← T k * (q) .O(N 3 k ),where N k is the number of observations in a cluster k.Therefore, the time complexity is O(N 3 * ), where N * denotes the number of observations in the largest cluster.For the other extreme, K = K c , the cluster-wise training time complexity is O(N 2 k N ) which implies that the overall training time complexity is O(N 2

Table 1 :
Results for the hamming loss (hl) and accuracy metrics (acc).The elements in the table are formatted as hl/acc.For hl, a smaller value is better, for acc, a larger value is better.In the last row, the training time complexities are shown with respect to the data size N .The number of observations in the largest cluster is denoted as N * .