in

Concept-based image search is an emerging search paradigm that utilizes a set of concepts as intermediate semantic descriptors of images to bridge the semantic gap. Typically, a user query is rather complex and cannot be well described using a single concept. However, it is less e ﬀ ective to tackle such complex queries by simply aggregating the individual search results for the constituent concepts. In this paper, we propose to introduce the learning to rank techniques to concept-based image search for complex queries. With freely available social tagged images, we ﬁrst build concept detectors by jointly leveraging the heterogeneous visual features. Then, to formulate the image relevance, we explicitly model the individual weight of each constituent concept in a complex query. The dependence among constituent concepts, as well as the relatedness between query and non-query concepts, are also considered through modeling the pairwise concept correlations in a factorization way. Finally, we train our model to directly optimize the image ranking performance for complex queries under a pairwise learning to rank framework. Extensive experiments on two benchmark datasets well veriﬁed the promise of our approach.


Introduction
With rapid advances in Internet and multimedia technologies, the past few years have witnessed an explosive growth of digital images on the Web.The proliferation of images raises an urgent demand for effective image search technologies.Due to the well-known semantic gap between low-level features and high-level semantics [1,2], current commercial search engines retrieve images mainly based on their associated contextual information such as titles and surrounding text on Web pages.However, since the associated text is usually unreliable to describe the semantic content of images, the performance of textbased image search methods is still far from satisfactory.
As an alternative to text-based image search, concept-based image search has recently attracted increasing attention and proven to be a promising solution for large-scale search tasks [3,4,5].In concept-based image search, a set of concept detectors are pre-built to predict the presence of specific concepts, which provide direct access to the semantic content of images.Given a textual query, it is mapped to a group of primitive concepts, and the search results are made up of the images in which these concepts are likely to appear.Thanks to the continuous progress in visual concept detection [6,7], current conceptbased search techniques can effectively deal with queries involving only one concept.In reality, however, a user query is Email addresses: bruincui@gmail.com(Chaoran Cui), jlshen@smu.edu.sg(Jialie Shen), chenzhumin@sdu.edu.cn(Zhumin Chen), shuaiqiang.wang@jyu.fi(Shuaiqiang Wang), majun@sdu.edu.cn(Jun Ma) rather complex and cannot be well represented by a single concept.For example, consider a query like "a person with a camera on the street", which apparently involves multiple semantic concepts, i.e., "person", "camera", and "street".
Confronted with a complex query comprising several semantic concepts, a natural idea is to combine the individual search results for the constituent concepts in the query.However, such a straightforward strategy may be ineffective due to the following reasons.First of all, many existing methods assume all constituent concepts are of equal importance [8] or determine their combination weights based on some heuristic rules [9].From the perspective of information theory, the importance of a constituent concept can be interpreted as the information it bears when the complex query is observed [10].Different constituent concepts typically exhibit different degrees of informativeness, which are data-dependent and difficult to determine in advance.Secondly, the constituent concepts in a complex query do not appear in isolation; instead, they interact with each other in the semantic level and mutually reinforce their roles during the search process.It is inappropriate to consider the constitute concepts independently and ignore their inter-dependence [3].Lastly, the concepts not in a complex query may also serve as the contextual information to enhance the search accuracy [11].Recall the aforementioned query example, i.e., "a person with a camera on the street".If an image has a high response for the detector of a non-query concept "sofa", we may have high confidence that the image is irrelevant to the query, since "sofa" rarely appears together with the query concept "street".Nevertheless, the information cues conveyed by the non-query Recently, learning to rank techniques [12] have been extensively studied owing to its potential for improving information retrieval systems.In general, learning to rank refers to applying supervised machine learning algorithms to construct the optimal ranking model in a search task.Intuitively, through the supervision step, the possibility is offered that utilizing the information from the data collection to steer the search process and reduce the need for making heuristic assumptions [13].Although great success has been achieved [14,15], few research efforts have been devoted to exploring the potential of learning to rank in concept-based image search.
Motivated by the above discussions, in this paper, we propose to introduce the learning to ranking techniques to concept-based image search for complex queries.A collection of concept detectors are first built from social tagged images by jointly leveraging the heterogeneous visual features.To mitigate the limitations of existing methods mentioned above, in the formulation of the image relevance function, we explicitly model the individual weight of each constituent concept in a complex query.The dependence among constituent concepts, as well as the relatedness between query and non-query concepts, are also considered by modeling the pairwise concept correlations.Faced with the underlying overfitting problem arising from too many model parameters, we adopt the Factorization Machine [16] to factorize concept correlations with a low-rank approximation.The learning of different model parameters is effectively integrated into a pairwise learning to rank framework, and we build upon the Ranking SVM algorithm [17] to train our model by directly optimizing the image ranking performance for complex queries.It is worth noting that the scalability of our approach is not degraded, even though the supervision step is introduced.This is because the ground-truth information used in training is only for a limited number of complex queries, but from which a query-independent model can be learned and employed to rank images for all queries.
The main contributions can be summarized as follows: • Our approach resolves the problem of concept-based image search from the perspective of learning to rank, and directly optimizes the image ranking performance for complex queries.
• Our approach explicitly models the individual weight of each constituent concept.To capture the dependence among constituent concepts, as well as the relatedness between query and non-query concepts, the pairwise concept correlations are also modeled in a factorization way.
• Our approach has been evaluated on two publicly accessible benchmark datasets.The experimental results demonstrate the promise of our approach in comparison with the state-of-the-art methods.
The remainder of this paper is structured as follows.Section 2 reviews the related work.Section 3 details our proposed approach to concept-based image search for complex queries.
Experimental results and analysis are reported in Section 4, followed by the conclusion and future work in Section 5.

Visual Concept Detection
Serving as the foundation for concept-based image search, visual concept detection has attracted considerable research interests in the multimedia computing community.Typically, it is transformed to a classification problem, in which each concept is treated as a class label and its presence likelihood is estimated by the classifier prediction score.For example, Lu et al. [18] proposed an multi-modality classifier combination framework to improve the accuracy of semantic concept detection.Multiple classifiers trained on different visual features were combined with a probability-based fusion method.Some studies provided insights on how to construct feature representations in building classifiers for concept detection.In [19], an efficient bag-of-visual-word construction method was developed based on sparse non-negative matrix factorization and GPU enabled SIFT feature extraction.Li et al. [20] employed latent Dirichlet allocation approach to cluster the image data into semantic topics, and the distributions of image low-level features on such topics were used as the middle-level features of images.Yan et al. [21] proposed to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concept high-level feature descriptions.A novel event oriented dictionary representation was then introduced based on the selected semantic concepts.Besides, the zero-shot learning has also been applied to handle event detection in videos [22,23].The key idea is to pre-train a number of concept classifiers using data from other sources, such that an event of interest can be detected based on its semantic correlation with respect to each concept, even when no labeled example of this event is supplied.

Concept-based Image Search
Given a collection of concept detectors, concept-based image search for complex queries can be performed by fusing the individual search results for the constituent concepts in a query.A critical issue in the fusion strategy is to determine the combination weights.Nastsev et al. [24] proposed to assign equal weight to the search result for each constituent concept.Chang et al. [25] weighted the individual concept detectors according to their training performance.Li et al. [26] set the weight to be proportional to the informativeness of a constituent concept.Despite encouraging results reported, these heuristic fusion methods are data-independent and may not be effective to the same degree in different application scenarios.On the contrary, in our approach, the individual weight of each constituent concept is explicitly modeled and automatically determined with the information harvested from the data collection.
Another potential limitation of the above fusion-based methods lies in that they consider the constituent concepts independently and ignore their mutual relationships.To address this Yuan et al. [4] leveraged the plentiful but partially related samples, as well as the users' feedbacks, to handle complex queries in the interactive concept-based video search.By extending this idea, they further proposed a higher-level semantic descriptor named "concept bundle", which integrates multiple primitive concepts, to describe the visual representation of complex semantics and enhance the video search for complex queries [27].Li et al. [10] learned bi-concept detectors from social tagged images, and applied them in a search engine for retrieving images relevant to bi-concept queries.In [3], the authors developed an image reranking scheme for complex queries by jointly considering multiple relationships between concepts and complex queries from high-level to lowlevel.Similarly, Guo et al. [5] proposed a multi-layer probabilistic model to incorporate inter-concept relatedness into image reranking for complex queries.Compared to the previous work, our approach models the pairwise concept correlations in a factorization manner.Through this way, we consider not only the dependence among constituent concepts, but also the relatedness between query and non-query concepts.

Learning to Rank
There is an emerging research interest in learning to rank due to its importance in a wide variety of applications, such as information retrieval [15] and personalized recommendation [28].Roughly speaking, the existing learning to rank techniques can be divided into three categories: pointwise methods, pairwise methods, and listwise methods.In pointwise methods [29], ranking is treated as a regression or classification problem on individual items to predict their relevance scores.In pairwise methods [30], ranking is transformed to a classification problem on item pairs to predict the preference relation between two items.In listwise methods [31], ranking is performed to minimize a direct loss between the true ranking list and the estimated ranking list.A comprehensive survey of learning to rank can be found in [12].In this paper, our approach follows the direction of pairwise methods, because of their superior performance and relatively low complexity.

Framework
To formulate our problem, we declare some notations in advance.In particular, we use capital letters (e.g., X) and bold lowercase letters (e.g., x) to denote sets and vectors, respectively.We employ non-bold lowercase letters (e.g., x) to represent scalars, and Greek letters (e.g., λ) as hyper-parameters.If not clarified, all vectors are in column form.Table 1 summarizes the key notations and definitions used throughout the paper.
Our framework consists of three main components: 1) visual concept detection, 2) image relevance formulation, and 3) ranking-oriented learning.By harnessing freely available social tagged images, visual concept detectors are first built without the need of manually selecting training samples for each concept.With the pre-built concept detectors, an image relevance

C
The set of all concepts c, q, p A certain concept T The set of all possible complex queries The set of all labeled images x i , Y i A certain labeled image and the set of concepts associated with it m The number of concepts n The number of labeled images d The dimensionality of the latent space for concepts w c The weight of c v c The vector representation of c in the latent space s qc The correlation between q and c S The set of social tagged images S c The subset of images tagged with c Z The set of visual features z A certain visual feature k The number of visual neighbors S x,z The neighbor set of x based on z D The set of preference pairs l The sample size in each iteration during training α, β, λ, γ The hyper-parameters function for complex queries is then formulated, which explicitly takes into account concept weights and concept correlations.Based on the relevance formulation, the ranking-oriented learning is ultimately developed to determine the model parameters through optimizing the image ranking performance for complex queries.The architecture of our framework is illustrated in Figure 1.In the following, we elaborate on each of the components and give a full description of the associated algorithms.

Visual Concept Detection
As a prerequisite to realize concept-based image search, various concept detectors need to be built in advance to predict the presence likelihood of the corresponding semantic concepts given a specific image.An appealing source of labeled images for concept detection are social tagged images on the Web [10], in which user-contributed tags encode valuable information about the semantic content of images.As mentioned previously, a typical solution is to train a separate classifier for each concept over social tagged images, and estimate the presence of that concept by the classifier prediction score.However, this concept-specific modeling paradigm suffers from two main disadvantages.First, it is not scalable to cover the potentially To avoid the above problems, we adopt a data-driven approach, called the neighbor voting algorithm [32], for concept detection in this paper.The philosophy behind the neighbor voting algorithm is that if visually similar images are tagged with the same concepts, these concepts are likely to reflect the actual visual content.Despite its simplicity, recent studies [33] have reported that the neighbor voting algorithm remains the state-of-the-art for visual concept detection.In addition, a semantic concept generally has significant diversity in terms of the visual appearance.It is hence insufficient to rely on a single visual feature to characterize such large visual variations.In light of this, we seek to jointly leverage the heterogeneous visual features for building more robust concept detectors.
Let S be a collection of social tagged images, and C a vocabulary consisting of m concepts.For each concept c ∈ C, S c denotes the subset of images tagged with c, i.e., S c ⊂ S .We use x to denote an image, and Z is a set of visual features.Given a visual feature z ∈ Z, we represent x using z and find the k nearest neighbors of x from S according to the visual similarity measured over z. S x,z denotes the resulting neighbor set of x, based on which the neighbor voting algorithm constructs a base concept detector as follows: where | • | is the cardinality of a set.Intuitively, the more frequent a concept occurs in the neighbor set, the more relevant it might be to the given image; however, common concepts with high frequency in the entire collection are usually less descriptive, and thus their estimated relevance should be suppressed.Towards this end, the base concept detector g z (c, x) counts the difference between the distribution of c in x's neighbor set and that in the entire image collection.
To overcome the limitation of single features in describing the visual content, we further combine the base concept detectors obtained with different visual features.The work in [34] compared the unsupervised and supervised combination strategies in the context of neighbor voting model, and the empirical results showed that there is no significant difference in performance between them.However, a major disadvantage of the supervised combination strategy is its expensiveness in terms of the training efforts, which inevitably leads to much more computational cost.In light of this, we adopt the unsupervised uniform combination rule in our approach.Specifically, the concept detector is defined as follows: where r(c, x) indicates the confidence that the concept c is present in the image x.

Image Relevance Formulation
In this paper, we target at the problem of concept-based image search for complex queries.Let Q be a complex query comprising two or more concepts, i.e., Q ⊂ C and |Q| ≥ 2. The key challenge is to establish a function f (Q, x) that measures the relevance score of an image x with respect to the complex query Q. Intuitively, each constituent concept q ∈ Q partially describes the user's search intentions carried by Q, and f (Q, x) can thus be estimated by aggregating the presence likelihood of each q in x.Inspired by this, we first formulate f (Q, x) as follows: where w q is a weight parameter indicating the importance of q.Distinguished from previous methods combining different constituent concepts heuristically, we explicitly model the individual weight of each constituent concept.Unlike a single-concept query, a complex query also depicts the intrinsic semantic dependence among its constituent concepts [27].Different constituent concepts do not appear in isolation; instead, they interact with each other and mutually reinforce their roles in the search process for the complex query.In addition, as aforementioned, the concepts not in the query often

A C C E P T E D M A N U S C R I P T
provide additional information cues.Hence, it is highly beneficial to retrieve images by simultaneously using both query concepts and non-query ones.In view of this, we explore the possibility of introducing concept correlations to the image relevance function.
WordNet similarity is widely adopted to capture the semantic correlations among concepts.Nonetheless, as it does not directly reflect how people describe the visual content, some highly correlated concepts are usually weakly related in the WordNet ontology [35].Concept co-occurrence is another commonly used correlation measurement.However, in most annotated corpus, images are frequently associated with only a few concept labels, which may lead to unreliable co-occurrence statistics.More importantly, apart from the positive correlations among concepts, there also exist many important negative correlations.Unfortunately, limited by their non-negative property, both WordNet similarity and co-occurrence statistics cannot reflect the potential negative correlations.
Given the drawbacks of existing correlation measurements, we propose to model the pairwise correlations between concepts, and extend our initial relevance function in Eq. ( 3) as follows: where s qp is a model parameter capturing the correlation between two concepts q and p.We assume that the concept correlations are symmetric, i.e., s qp = s pq , and both positive and negative values are allowed.In Eq. ( 4), the first term represents the relevance estimated by separately considering each constituent concept in the complex query, the second term encodes the interactions among constituent concepts, and the last term ensures that the information from non-query concepts can also be utilized.The three parts cooperate with each other, leading to a more accurate estimation for the image relevance.Here, α and β are two hyper-parameters used to control the relative contribution of each term.A potential problem in the above formulation is that it requires a huge amount of parameters to model the correlation between each pair of concepts in the vocabulary.From the viewpoint of statistical learning theory, too many model parameters may degrade the model stability and result in the overfitting problem.The existing work [36] on text information processing has demonstrated that the semantic space spanned by textual keywords can be approximated by a smaller set of latent factors.As one kind of text information, semantic concepts are also subject to such a low-rank property [37].Inspired by this, we apply the Factorization Machine [16] to model the pairwise concept correlations in a factorization way.Specifically, each concept c ∈ C is mapped to a vector v c ∈ R d in a d-dimensional latent space, and the correlation s qp is subsequently approximated by s qp = v T q v p .Intuitively, s qp corresponds to the dot product of v q and v p in the latent space, which is a commonly used measure for matching textual vectors.As a result, the im-age relevance function can be reformulated as follows: Because the intrinsic dimensionality of the latent space is typically much smaller than the total number of concepts (i.e., d m), the number of model parameters in Eq. ( 5) is significantly reduced.Besides, it has been shown that the problems of concept synonymy and polysemy can be more easily handled in a low-dimensional semantic space.

Ranking-oriented Learning
We aim to enhance the accuracy of concept-based image search for complex queries by learning the relevance function f in a supervised way.In the supervised scenario, we are given a set of labeled images L = x 1 , x 2 , . . ., x n , where each image x i is associated with Y i that denotes the set of concepts having been assigned to x i .Let T be a set of complex queries.Given a complex query Q ∈ T , the ground-truth relevance of x i with respect to Q is defined as: Eq. ( 6) ensures that the images associated with more query concepts will be assigned higher relevance.Based on the ground-truth relevance, a set of pairwise preference relations D ⊆ T × L × L can be further derived: where each triple (Q, x i , x j ) reflects the partial order information of the ground-truth image ranking for Q.To optimize the image ranking performance for complex queries, we require the relevance function f to satisfy the preference pairs in D as much as possible.In other words, the goal of learning is to minimize the following empirical risk: where 1 (•) is an indicator function that outputs 1 if the input Boolean expression is true and zero otherwise.Actually, R( f ) measures the proportion of the preference pairs misordered by the relevance function f .Since the indicator function 1 (•) is nonsmooth, directly optimizing the empirical risk in Eq. ( 8) is computationally infeasible [14].To address the problem, we adopt the Ranking SVM framework [17] as the backbone of our learning method.The basic idea of Ranking SVM is to replace As a result, the relevance function f can be learned through the following optimization problem:

min.
w,v,ξ Here, ξ Q,x i ,x j is a slack variable associated with the triple (Q, x i , x j ).It can be demonstrated that the average over all slack variables is an upper bound on the empirical risk in Eq. ( 8).λ 1 and λ 2 are the hyper-parameters representing the weights of the regularization terms.
The main difficulty of Optimization Problem 1 lies in that there are too many (i.e., |D|) constraints to be considered.To solve it efficiently, we resort to the Pegasos algorithm [38] to optimize the primal form of the problem.At each iteration of the Pegasos algorithm, a subset D s of l training triples is first sampled from D uniformly at random.Then, the subgradients with respect to the model parameters involved with the triples in D s are computed.Specifically, we use θ to denote an arbitrary model parameter, and the subgradient of the objective function Ω regarding θ can be computed by: where ∇ θ f (Q, x) is the subgradient of the relevance function f with respect to θ, which is calculated by: Lastly, θ is updated in the opposite direction of ∇ θ Ω with a Once the model parameters are learned, given a new complex query, the relevance score of a specific image with respect to the query can be estimated by Eq. ( 5).Based on this score, we obtain the image ranking results for the complex query.

Experiments
In this section, we report a series of experiments conducted to evaluate our approach in the scenario of concept-based image search for complex queries.

Datasets
To ensure the accuracy and fairness of our empirical results, we adopted two benchmark image datasets collected from Flickr1 in our evaluation.Dataset I is MIRFlickr [39], which consists of 25,000 images.In this dataset, the ground-truth labeling for 18 concepts has been provided, and the average number of concepts per image is 2.03.Note that these concepts all correspond to frequent tags in Flickr and cover different genres including scenes, objects, and events.Dataset II is NUS-WIDE-LITE [40], which contains 55,615 images with their associated tags.Likewise, the ground-truth annotations of 81 concepts for all images are available in the dataset.Each image is annotated with an average of 4.21 concepts.
Since there are no pre-defined complex queries available, we need to first construct the query set.Following the procedures in [41], we created a complex query by randomly combining the given concepts in the dataset.As reported in [42], Web queries are generally short, and the average number of terms per query is 2.4.According to the recent statistics2 from the US, only less than 6.2% of the queries have more than 5 terms.Therefore, the length of a complex query was set to be between 2 and 5 concepts in our experiments.Besides, we only kept the complex queries for which more than 1% of all the images are annotated with their constituent concepts.The preceding steps finally led to 121 complex queries for Dataset I and 488 for Dataset II, respectively.On both datasets, we took half of the complex queries for training, and used the rest for testing.The main statistics of the datasets are summarized in Table 2.

Experimental Settings
To implement the concept detectors described in Section 3.1, we used five types of low-level visual features to represent each image, namely, 1) 64-dimensional color histogram, 2) 144dimensional color correlogram, 3) 73-dimensional edge direction histogram, 4) 128-dimensional wavelet texture, and 5) 225dimensional block-wise color moment.These features characterize images from different perspectives of color, shape and texture.On the basis of each feature, we used the L 1 metric to measure the visual distance between images.Given an image, all the other images were ranked by their distance from it and the k nearest neighbors were subsequently discovered.
Given a complex query, we generated the ranking list by sorting images in descending order of their relevance with respect to the query.We adopted the Normalized Discounted Cumulative Gain (NDCG) [43] to evaluate the quality of an ranking list.NDCG at the n-th position is computed as: where rel(i) is the relevance of the i-th image in the ranking list, which is defined in Eq. ( 6).N is a normalization constant used to ensure that the NDCG score of the ground-truth image ranking is 1.The average value of NDCG@n (n = 10, 50, 100) over all test complex queries was reported to evaluate the overall performance.
There are several hyper-parameters in our model.For the trade-off parameters α and β in Eq. ( 4), we carried out a grid search over the range of [0, 1] with the granularity of 0.1.The best performance was achieved when α = 0.6 and β = 0.1.For the dimension of the latent space d, we considered the values in the range of [5,50] with a step size of 5.The results demonstrated that there is no significant performance improvement when d is beyond 10.To reduce the computational complexity, we chose d = 10 on both datasets.For the number of visual neighbors k in Eq. ( 1), we set k = 300, and the effect of the value of k on the performance will be discussed later.For the regularization parameters λ 1 and λ 2 in Optimization Problem 1, we performed a logarithmic grid search from 10 −5 to 10 5 with the scaling factor of 10, and observed the best performance when λ 1 = λ 2 = 0.1.In Algorithm 1, for the sample size l and the learning rate γ, we empirically used l = 3000 and γ = 0.01, respectively.

Competitors
We compared our approach against several state-of-the-art methods for concept-based image search.For these baseline methods, the parameters were tuned via 5-fold cross validation.Specifically, the competitors in our experiments are: • TagMatch: This method simply estimates the relevance of an image based on the overlap between the tags associated with the image and the concepts in the given query.
• TagProp [8]: This method exploits a weighted nearestneighbor model together with the distance metric learning to predict the presence probability of a concept.Given a complex query, it takes the product of the presence probabilities of constituent concepts as the relevance score of an image.
• BiGraph [44]: This method proposes a bi-relational graph model that comprises both the image graph and the concept graph, and connects them by an additional bipartite graph induced from concept assignments.The random walk with restart algorithm is performed over the graph by setting the constituent concepts of a complex query as the starting nodes.The relevance scores can be calculated according to the stationary distribution for all image nodes.
• LTRCS: This is our proposed approach that introduces the learning to rank techniques to concept-based image search for complex queries.
In our approach, we model the individual weight of each constituent concept in a complex query.The pairwise concept correlations are also modeled to capture the dependence among constituent concepts, as well as the relatedness between query and non-query concepts.To investigate the efficacy of each component, two variants of our original model were also introduced to the comparison: • LTRCS-EW: Instead of explicitly modeling the weight of each constituent concept, this method assigns equal weights to all constituent concepts in a complex query.
• LTRCS-CO: Rather than learning the concept correlations in a supervised manner, this method uses the cooccurrence statistics as the correlation measurement.
All the methods listed above were fully implemented in Python or Matlab, and tested on a server equipped with 24-core 2.00GHz Intel Xeon processor and 32GB RAM.

Overall Performance
Figure 2 displays the empirical results of different methods on Dataset I.It is clearly shown that LTRCS consistently outperforms the other competitors in all evaluation metrics.For example, compared with TagMatch, TagProp, and BiGraph, LTRCS gains 4.4%, 2.5%, and 7.5% relative improvement in terms of NDCG@10, respectively.To further analyze the results, we performed paired t-test [45] to compare the difference   between LTRCS and the other methods, and found that the improvement of LTRCS is statistically significant at the significance level of 0.05.These results verify the potential of LTRCS in concept-based image search for complex queries.
As can be seen, in comparison to LTRCS, the two variants, LTRCS-EW and LTRCS-CO, both suffer certain performance degradation in different metrics.Since each of them determines one kind of model parameters based on heuristic rules, such results point clearly to the importance of learning concept weights and concept correlations in a supervised manner.Besides, we notice that LTRCS-EW experiences a more significant decrease in performance than LTRCS-CO, which implies that explicitly modeling concept weights makes a greater contribution to the effectiveness of our approach.
Figure 3 summarizes the comparison results on Dataset II.Again, the proposed approach LTRCS outperforms its counterparts with statistically significant improvement in all metrics.To our surprise, the existing methods, TagProp and BiGraph, substantially fall behind TagMatch, which only calculates the relevance of images by matching their associated tags against the given query.In contrast, LTRCS consistently achieves superior performance to TagMatch, reaching to 4.3% relative improvement on average.These findings further support the conclusion that our approach emerges as the most effective search scheme for complex queries among all the competitors.

Performance Across Queries with Different Lengths
Intuitively, a complex query composed of more concepts carries more sophisticated search intentions, which also increase the difficulty of the search task for the query.Motivated by this, we further studied how different methods behave for complex queries with various lengths.In our experiments, the length of a complex query ranged from 2 to 5 concepts.We adopted Dataset II as the evaluation testbed, since it contains sufficient queries of different lengths.Out of the 245 test queries on Dataset II, the number of queries of lengths 2, 3, 4, and 5 are 82, 95, 53, and 15, respectively.
Table 3 presents the performance of different methods for queries of different lengths in terms of NDCG@10.We can see that with the increase of the length of queries, the search performance of all methods drops gradually.This phenomenon is consistent with the intuition that it is more challenging to generate accurate search results for queries consisting of more concepts.As expected, LTRCS achieves the best performance in all cases.Especially, LTRCS gains a higher relative improvement for the complex queries with 4 or 5 concepts, leading to at least 4.7% and 7.1% for the two types in terms of NDCG@10.These results indicate that our approach is particularly applicable to long queries in concept-based image search.

Impact of Number of Visual Neighbors
In this study, we develop the neighbor voting algorithm to build visual concept detectors.A key parameter in the algorithm is the number of visual neighbors considered, i.e., the parameter k.To investigate the impact of k, we conducted experiments to observe the performance variation of our approach when changing k from 10 to 2000. Figure 4 shows how the performance varies with different values of k on Dataset I, where three curves fluctuate, reflecting the impact of k in terms of different metrics.It can be observed that all performance curves have a similar variation trend.Specifically, as k increases, the performance curves go up at first, but when k is beyond a certain threshold, they turn to decline with further increase of k.We believe this phenomenon is reasonable because a small number of neighbors are unable to completely characterize the semantics of a given image, whereas too many neighbors may introduce information irrelevant to that image.In our case, the best performance is achieved when k = 300.

Efficiency Analysis
To further examine the practical utility of our approach, in this subsection, we analyze the efficiency of our learning algorithm.The complexity of estimating the image relevance in Eq. ( 5) is O qmd , where q is the average length of complex queries.Given the fact that most complex queries are composed of only a few concepts, we have q m, and the complexity can be regarded as O md .In Eq. ( 12), the subgradient of the relevance function f can be computed in O md .Consequently, the overall complexity of one iteration in Algorithm 1 is O lmd .
In actual experiments, our Python implementation of the algorithm took approximately 4.16 seconds per iteration on Dataset I and 7.57 seconds on Dataset II, respectively.Figure 5 displays the convergence process of the iterative optimization, which was measured by the objective function value over a set of randomly selected training triples.It shows that the algorithm generally converges within 30 iterations during training.In Table 4, we report the training time of our approach in comparison with that of the other supervised competitors, i.e., TagProp and BiGraph.Clearly, LTRCS gives a substantial reduction in the training time when compared to BiGraph.Although LTRCS takes over 1.5 times longer than TagProp, it has a significant superiority in accuracy as shown in Figure 2 and Figure 3.We believe the gain outweighs the loss.Once training is completed, during testing, our approach took an average of 0.17 seconds to yield the image ranking result for a complex query.This means that our trained model can be used interactively by users without any perceived delay.To sum up, the above analysis verifies that our approach is computationally efficient and applicable to large-scale use cases.

Correlation Illustration
In our framework, we model the concept correlations to capture the dependence among constituent concepts as well as the relatedness between query and non-query concepts.To gain a more intuitive understanding, we randomly sample a subset of concepts, and illustrate the learned correlations between each pair of the sampled concepts in Figure 6, where a color map is used to indicate the magnitude of the correlations.
From the figure, we can see that many frequently cooccurring concepts, such as (clouds, sky), (beach, sea) and (lake, sunset), are assigned higher correlations.Analogously, the pairs of concepts with the same or similar meanings, like (road, street) and (sea, water), also have higher correlations.In contrast, lower negative correlation values are allocated to those rarely co-occurring concepts, such as (beach, buildings), (person, sky) and (leaf, sunset).Note that the range of the learned correlation values is asymmetric about zero.Moreover, the elements on the main diagonal represent the self-correlation of each concept.It can be clearly observed that, a diagonal element generally has a higher correlation value compared with the other elements in the same row or column, which is in accordance with the intuition that a concept is more correlated with itself than with others.In view of these findings, we believe that various kinds of relationships among concepts can be effectively captured by the learned correlations.

Conclusion and Future Work
In this paper, we have investigated the challenge of conceptbased image search for complex queries, and addressed the problem from the perspective of learning to rank.With freely available social tagged images, we build concept detectors by jointly leveraging the heterogeneous visual features.To avoid the risk of making heuristic assumptions, the individual weight of each constituent concept in a complex query is explicitly modeled when estimating the image relevance.To capture the dependence among constituent concepts, as well as the relatedness between query and non-query concepts, the pairwise concept correlations are also modeled with a low-rank approximation.The learning of different parameters are performed

A C C E P T E D M A N U S C R I P T
through directly optimizing the image ranking performance for complex queries.Extensive experiments have been conducted on two benchmark datasets in comparison with the state-of-theart methods from different aspects.The results have demonstrated the effectiveness of our approach.
Our future work will focus on three directions.Firstly, we intend to apply the distance metric learning techniques to improve the quality of visual neighbors for concept detection.Secondly, we plan to experiment with other learning to rank algorithms to enhance the learning process of our current scheme.Finally, we will further investigate the scalability of our approach by experimenting on larger image datasets.
S C R I P T concepts have not been fully exploited in prior concept-based image search methods.

Figure 1 :
Figure 1: Schematic illustration of the proposed concept-based image search approach for complex queries.

Algorithm 1 1 : for c ∈ C do 2 :repeat 5 : 7 :
The Pegasos Algorithm Input: Set of preference pairs D, sample size l, and learning rate γ Output: Model parameters θ = (w, v) Initialize w c and v c randomly 3: end for 4: Sample a subset D s of l training triples from D 6: Compute ∇ θ Ω based on Eq. (10) Update θ = θ − γ∇ θ Ω 8: until convergence Optimization Problem 1.

Figure 2 :
Figure 2: Performance comparison on Dataset I.

Figure 3 :
Figure 3: Performance comparison on Dataset II.

Figure 4 :Figure 5 :
Figure 4: Impact of the number of visual neighbors k.

Figure 6 :
Figure 6: Illustration of the learned pairwise correlations between concepts.

Table 1 :
Summary of key notations and definitions.

Table 2 :
Statistics of the experimental datasets.

Table 3 :
Performance comparison among methods for complex queries with various lengths, in terms of NDCG@10.

Table 4 :
Training time comparison (in seconds).