User Session Level Diverse Reranking of Search Results

Most Web search diversity approaches can be categorized as Document Level Diversiﬁcation ( DocLD ), Topic Level Diversiﬁcation ( TopicLD ) or Term Level Diversiﬁcation ( TermLD ). DocLD selects the relevant documents with minimal content overlap to each other. It does not take the coverage of query subtopics into account. TopicLD solves this by modeling query subtopics explicitly. However, the automatic mining of query subtopics is diﬃcult. TermLD tries to cover as many query topic terms as possible, which reduces the task of ﬁnding a query’s subtopics into ﬁnding a set of representative topic terms. In this paper, we propose a novel User Session Level Diversiﬁcation ( UserLD ) approach based on the observation that a query’s subtopics are implicitly reﬂected by the search intents in diﬀerent user sessions. Our approach consists of two phases: (I) Session Graph Construction and (II) Diversity Reranking . For a given query, phase (I) builds a Session Graph which considers relevant user sessions and preliminary retrieval results as nodes and the nodes’ pairwise similarities as edge weights. Phase (II) reranks the preliminary retrieval results by minimizing a Session Graph based diversity loss function. Extensive experiments on two standard datasets of NACSIS Test Collections for IR (NTCIR) demonstrate the eﬀectiveness of our approach. The advantage of our approach lies in its ability

cover as many query topic terms as possible, which reduces the task of finding a query's subtopics into finding a set of representative topic terms. In this paper, we propose a novel User Session Level Diversification (UserLD ) approach based on the observation that a query's subtopics are implicitly reflected by the search intents in different user sessions. Our approach consists of two phases: (I) Session Graph Construction and (II) Diversity Reranking. For a given query, phase (I) builds a Session Graph which considers relevant user sessions and preliminary retrieval results as nodes and the nodes' pairwise similarities as edge weights. Phase (II) reranks the preliminary retrieval results by minimizing a Session Graph based diversity loss function. Extensive experiments on two standard datasets of NACSIS Test Collections for IR (NTCIR) demonstrate the effectiveness of our approach. The advantage of our approach lies in its ability

Introduction
Search result diversification has attracted significant attention recently as a method to improve the performance of Web retrieval systems [1,2,3,4,5].
There are at least three reasons for this. First, most queries are ambiguous and multifaceted [6,7]. A canonical example is the query "jaguar". For this 5 query, search engines should diversify the results because they do not know whether the query refers to the animal, the car or the software. Sometimes, even if searchers think they have offered enough information, their queries are still ambiguous. For the extended query "jaguar car", search engines still do not understand whether it represents "buying a jaguar car", "new releases of 10 jaguar car", "price of jaguar car", and so on. Second, users' information needs are uncertain, exploratory and personalized. That is, the information needs of the same query may vary from user to user. For example, for the query "swine flu", doctors and patients may be interested in different aspects of the same topic. However, it is hard for search engines to get enough personalized 15 information to understand users' exact intents under the current keyword-based search scenario. Especially when a user issues her/his first query, we know nothing (no past user behaviors) but the query. Third, content overlap exists among documents. Search engines should not return duplicate and redundant search results. 20 Tremendous efforts have been made on Web search result diversification [8,9,10], most of which can be summarized into three groups: Document Level Diversification (DocLD), Topic Level Diversification (TopicLD ) and Term Level Diversification (TermLD ), as shown in Figure 1. DocLD includes Maximal Marginal Relevance (MMR) [11] and its probabilistic variants [12]. They  promote diversity at document level by selecting relevant documents with the maximum contents difference. DocLD does not need any priori knowledge of query subtopics. However, there are no guarantees that the aspects covered by the selected documents correspond to query subtopics [13]. TopicLD models query subtopics explicitly and selects documents to cover as many subtopics 30 as possible, such as IA-Select [13], xQuAD [14], ACSL [15] and Proportionality Model [16]. These approaches are generally more effective. However, as Dang and Croft [8] pointed out that, they depend heavily on a set of predefined query subtopics, the automatic mining of which is difficult and is still an ongoing research problem. Dang and Croft proposed TermLD, which uses a set of terms 35 instead of subtopics of a query to promote diversity. This reduces the task of finding a query's subtopics into finding a set of representative terms.
Is there an approach that can diversify search results effectively while avoiding mining query subtopics or topic terms? In this paper, we address this problem by mining the rich human intelligence contained in query logs [17,18].

40
There are two widely accepted facts. To sum up, the primary contributions of this paper are as follows.
• We propose a novel two-phase framework UserLD for search result diversification.
• Our method does not rely on the subtopic mining results while achieving 65 better or comparable performances compared with previous methods.
• We prove that the objective function of our diversity model is non-negative, monotone and supermodular, based on which we present an algorithm to accelerate the practical running time of the reranking phase.
The remaining sections are organized as follows. In Section 2, we discuss 70 related work. In Section 3, we present our approach to implement UserLD. In Section 4, we report our experiment results. Section 5 concludes our study and discusses future work.

Related Work
There is a large amount of previous work on search result diversification.

75
Existing studies usually classify search result diversification as either explicit or implicit based on whether the approach models query subtopics explicitly or not. In this paper, we group existing diversification approaches into three classes from a different perspective: DocLD, TopicLD and TermLD.
DocLD approaches pursue a balance between content novelty and relevance 80 of documents. MMR [11] is one of the early influential work belonging to Do- adding diversity constraints to structured SVM learning framework. Zhu et al. [27] considered the process of diversity as a sequential selection process. They first defined several diversity related features. Then, they learned a diversity ranking function by minimizing a likelihood loss of the generation probability.
TopicLD approaches model the set of query subtopics and return relevant 95 documents to cover as many subtopics as possible. One of the state-of-art mod-5 els, IA-Select [13], supposes users only consider the top k returned results of a search engine. IA-Select tries to maximize the probability that there is at least one relevant result for each subtopic within the top k results. Another model, xQuAD [14], explicitly accounts for the various subtopics associated to supposed that the number of results belonging to each subtopic should be proportional to the subtopic's popularity. They treated the diversification problem as finding a proportional representation process over different subtopics for the 110 document ranking. Raman et al. [2] addressed the problem of intrinsic diversity, which has little ambiguity in intent but pursues content coverage of aspects on a certain subtopic. Their target was not a single query, but a type of complex task which spans multiple queries across one or more user search sessions. Santos et al. [28] assumed that there are various subtopics underlying a query, and that 115 users' information needs include navigational intents and informational intents.
They first learned the appropriateness of different retrieval models for each of the aspects underlying this query. Then they proposed an intent-aware search result diversification method to cover all subtopics and intents. Hong and Si [3] introduced two approaches to diversify results of resource selection in dis-120 tributed information retrieval. The first approach reranks the documents based on their relevance to the query subtopics. The second approach estimates the relevance of each information source with respect to different subtopics of the query by any existing resource selection algorithm.
TermLD approaches work by diversifying search results based on a set of 125 query topic terms. Those terms can be identified from the summarization of the preliminary documents ranking. Dang and Croft [8] proved the effectiveness 6 of TermLD and concluded that grouping those terms into subtopics provides little benefit to diversification compared to the presence of the terms themselves.
This reduces the task of finding a set of query subtopics into finding a simple 130 set of topic terms.
In this paper, we propose a new approach, namely UserLD. Different from

Notions and Notations
Before introducing our approach, we introduce some notations and key con- User queries can be classified into four types of patterns [21]. If the query is a single phrase, usually a noun phrase, then the type is "Q". The other 150 three types are "Q + W", "W + Q", and "Others", where "W" denotes some keywords [29,30]. For example, "Q": Harry Potter, "Q + W": Harry Potter movie, "W + Q": download Harry Potter. According to [21], the percentages of the four types are 45.5%, 25.5%, 16.5% and 12.5% respectively. This is reasonable because users tend to add additional keywords to specify their search 155 intents in their minds when the current results are not satisfactory [31]. Let u c

Q(u)
The collection of issued queries in the session u.

C(u)
The collection of clicked URLs in the session u.
q c The current issued query whose results need to be diversified.
The current user session that q c belongs to.

Q qc
The collection of "Q"-type, "Q + W"-type and "W + Q"-type queries corresponding to query q c in the query logs.
Q qc (u) The collection of "Q"-type, "Q + W"-type and "W + Q"-type queries corresponding to query q c in the session u.

C qc
The collection of all clicked URLs in the query logs where the issued queries belong to Q qc .
The collection of clicked URLs in the session u where the issued queries belong to Q qc .
U qc ⊆ U The collection of user sessions that contain at least one query belonging to Q qc .
The preliminary results of query q c returned by BM25.
The reranking results of query q c .  represent the current user session and q c represent the current issued query in the session which needs to be diversified. Then we consider q c as a "Q"-type query, and find out all "Q + W"-type and "W + Q"-type queries in the query logs. All those queries are considered as the related queries of q c , denoted as 160 Q qc (including q c itself). All user sessions containing one or more queries of Q qc constitute the session set U qc ⊆ U .

Our approach consists of two main phases: Session Graph Construction and
Diversity Reranking, as shown in Figure 2. For a given query q c , we first retrieve all user sessions which contain the query string (i.e., U qc ) and use the BM25 165 model with Lucene implement to generate the top 100 preliminary document results (i.e., D qc ). Then, we consider U qc and D qc as nodes and the nodes' pairwise similarities as edge weights to build the Session Graph. Finally, we rerank D qc by minimizing a Session Graph based diversity loss function.

170
The Session Graph G(u c , q c ) for the current user session u c and current issued query q c can be formalized as a four-tuples (V, E, W (U qc ), P (E)) [32]. V = {U qc ∪ D qc } is the collection of nodes. There are two kinds of nodes: user session nodes U qc and document nodes D qc . Each session node u ∈ U qc is associated with a weight value w(u) ∈ W (U qc ) reflecting the importance of 175 u. We consider several aspects to model w(u) in the next subsession. E = Each edge is associated with a probability weight value P (e) ∈ P (E) reflecting the similarity of the two nodes [33]. There are two key problems in building the Session Graph. 1) How to define the node weight w(u)? 2) How to define 180 the edge weight P (e) and compute the pairwise edge weights efficiently? We investigate the two problems respectively.

Node Weight w(u)
w(u) is defined as follows: Formula 1 contains two parts. The first part (i.e., imp(u)) is the priori importance of session u, which is estimated based on two aspects: issuing a popular query, clicking a popular URL. The two aspects are balanced with a parameter α.
where Q qc (u) and C qc (u) are queries and clicked URLs in session u corresponding to query q c ; C qc is the collection of clicked URLs corresponding to Q qc ; vol(q) is 185 the search volume of q; cli(url) is the click volume of url. ln(·) form is adopted to reduce the exponential growth of search/click volumes. imp(u) is computed offline.
The second part (i.e., sim(u c , u)) is the similarity of session u to the current where s i (u c , u) is a sub-similarity function. There are four sub-similarity functions in total. Although sim(u c , u) needs to be computed online, however it 190 can be computed efficiently in parallel for each u ∈ U qc . Given any two user sessions u and u , we propose two Query Similarities s 1 (·), s 2 (·) and two Click Similarities s 3 (·), s 4 (·) to evaluate their similarity.
Query Similarity. Query similarities measure the similarity of user search intents [34,35]. We define two query similarity functions in this paper.

195
The first similarity function describes the term match between the queries.
where q m i represents the term at position m of query q i , P os(q j , q m i ) is the set of all positions of query q j where the term is q m i . tms(q i , q j ) equals 1 if q i is the same with q j exactly and 0 if no terms overlap exists. Otherwise, tms(q i , q j ) is between 0 and 1 according to term sequence consistency. For each q j ∈ Q(u), we find its most similar q i ∈ Q(u c ) according to tms(q i , q j ) and tms(q j , q i ). Then 200 we average the similarities as s 1 (u c , u).
While the queries in two user sessions may not have direct term overlap, they may be similar semantically. To address this, the second function measures the semantic similarity of two queries:

11
In this paper, the word translation probability P (q m i |q n j ) is estimated offline based on the queries derived from the search logs [36,34].
where T F (q m i , q n j ) is the co-occurrence frequency of the two words in user issued queries; T F (q n j ) is term frequency of q n j in user issued queries. P (q m i |q n j ) measures the co-occurrence probability of two words. So s 2 (u c , u) reflects the semantic similarity of u c and u. 205 Click Similarity. The clicked URLs provide a different source of information about users' search intent besides queries [37,38]. We define two click similarity functions in this paper.
The first is based on Jaccard similarity.
where C(u) is the set of clicked URLs in the session u. s 3 (u c , u) measures the direct clicks overlap between u c and u. 210 While the clicked results in two user sessions may not have direct overlap in terms of clicked URLs, they may belong to the same topic. To address this, we define the second similarity based on Cosine similarity.
where url and url represent clicked URLs, W url denotes the word vector of all issued queries when users click the URL. s 4 (u c , u) is based on the intuition that the issued queries are good topic indicators of clicked URLs.

Edge Weight P (e)
Each document node is represented as a T F -IDF vector, T F -IDF (d). Each 215 user session node is represented as a T F -IDF vector of the issued queries and clicked documents, i.e., Then the edge weight probability P (e) is estimated as follows. β is the hyper parameter which will be detailed in the experiments.
We evaluate P (e) with a Cosine similarity for two reasons. First, the range

230
Given the Session Graph G(u c , q c ), we define the diversity loss function L(R, G(u c , q c )) based on G(u c , q c ) as: We take three measures to speed up Greedy.
(1) Lazy Greedy. We can prove that Formula 11 is non-negative, monotone 250 and supermodular. 2) monotone: The proof details of Theorem 1 are shown in appendix. Based on Theorem 1, the practical running time of Greedy can be alleviated by Lazy Greedy (or Accelerated Greedy) [40]. The key idea of applying Lazy Greedy to our 260 problem is that, according to Theorem 1, as the results R grows, the increments L d (R) will never increase. Assume R i is the results after adding the ith document. When selecting the next document d, instead of recomputing remaining node d .
Note that path(R ⇒ u) = d ∈R path(d ⇒ u). As a result, ments and σ( * ) represents the activated user session nodes by * and P * ⇒u is the probability that u is activated by * , i.e. P * ⇒u ≈ N ( * ⇒u) N with MC implement.
As a result, L d (R) is updated with MC as follows: The first part is the expected increment by newly activated nodes in σ(d) and the second part is the expected increment by overlap σ(d) ∩ σ(R). Similarly, The most time consuming task of Algorithm 2 is the Monte Carlo Simulation for each d j ∈ D qc (line 2 to 7). In our experiments, the Session Graphs are built 300 Algorithm 2: Diversity Reranking.

Input:
The graph, G(uc, qc); The number of results, k; Output: k document nodes, R; from commercial query logs collected in one month. In these Session Graphs, the Monte Carlo Simulation for each d j ∈ D qc can be finished in milliseconds.

Relevance Assessments
For evaluating the performance of diversity, we have to judge the relevant level for each document-subtopic pair as ground truth. The ground truth is built 320 using conventional pooling approach. The same interface was used for four assessors (two undergraduates and two graduates), which lets assessors view each pooled document and select a relevance grade for each subtopic: "Excellent", "Great", "Good", "Fair", "Bad". Two assessors were assigned to each query.

IA-Select(TopicLD,GT)
IA-Select [13] with given subtopics and subtopic importance distribution in ground truth.

IA-Select(TermLD)
IA-Select implemented with term level diversification approach proposed in [8].

UserLD UserLD
User session level search result diversification approach proposed in this paper.
the two assessors are negligible. The relevance grades were aggregated to form a five-point relevance scale, from L0 ("Bad") to L4 ("Excellent").

Baseline Diversity Models
We use seven classic models in diversity literature as baselines as shown in Table 2. The first is a DocLD approach, i.e. MMR [11]. The TopicLD 330 approaches include IA-Select [13] and xQuAD [14]. The TermLD approaches include IA-Select and xQuAD model implemented with term level diversification approach proposed in [8].

Experimental Tools and Parameter Settings
The where z is a subtopic of query q; P (z|q) is the subtopic importance distribution; nDCG z is nDCG for a particular subtopic z. nERR-IA can be computed similarly.
In order to solve the undernormalisation problem of IA metrics, Sakai et al. [42] proposed D#-metrics, which is computed as: where γ is a hyper parameter which will be detailed later. D-metric is computed by replacing the raw gain g(r) of cumulative-gain-based metrics such as nDCG and Q-measure with the global gain: where g z (r) is the gain value of document at rank r to subtopic z.
Most diversification mechanisms are evaluated using only diversity measures.

350
However, the diversity may be achieved at a cost of relevance. Therefore, in addition to above diversity measures, we also evaluate our results using four standard relevance-based metrics for Web retrieval: nDCG, nERR, P @k, and Q-measure. All of these metrics are computed using NTCIREVAL toolkit 7 .

IA metrics Results
The performance quantified by IA metrics is depicted in Figure 3. The IA metrics evaluate the diversity of the ranking results by computing the weighted sum of per-subtopic relevance, which forces a tradeoff between selecting documents with higher relevance scores and those that cover additional subtopics  So it is reasonable that IA-Select(TopicLD) achieves the best performance in terms of I-rec. Nevertheless, the performance of UserLD proves the feasibility of promoting diversity on user session level. 385 We further analyze the possible reasons that I-rec of UserLD is worse than IA-Select(TopicLD). First, some subtopics do not exist in the query logs. For example, the query "Transformers" (film series) contains 4 subtopics, i.e. "Transformers 1", "Transformers 2", "Transformers 3" and "Transformers 4". However, the query logs we used only contain "Transformers 1". Second, some 390 queries and subtopics are unpopular which are seldom searched by users. For example, most users will never seek for information about the query "Polysilicon". UserLD relies on user search behaviors, so the loss or sparsity of query logs (which is usually not a problem for commercial search engines) may influence its performance. Third, the most possible reason is that the query logs we 395 used are collected in only one month.

D#-metrics Results
We also evaluate UserLD with D#-metrics, which are more intuitive than other diversity metrics and promising for diversified IR evaluation according

25
to [42]. Different from IA metrics, D-metrics evaluate diversity by computing 400 a global gain, i.e. GG(r) = z P (z|q)g z (r), for each document at rank r in search results. D#-metrics balances the effect of I-rec and D-metric with a parameter γ. As Sakai et al. [42] showed that the effect of the choice of γ on IR experiments is relatively small due to the fact that I-rec and D-metrics are already highly correlated with each other. Following their study, we also set γ = 405 .5. The results are shown in Figure 5. As we can see, the results of D#-metrics are basically consistent with those of IA metrics which further confirms the improvement of UserLD.

Evaluation of Result Relevance
As diversity metrics measure per-subtopic documents relevance and favour 410 documents covering many subtopics but not necessarily very relevant to the given query, so we further conduct experiments to analyze whether diversity is achieved at a cost of relevance. Four standard relevance-based metrics for web retrieval, nDCG, nERR, Q-measure, and P @k are used to evaluate the search results relevance. nDCG, nERR and Q-measure take into account the posi- with w(e (u↔u ) ) and w(e (d↔d ) ).

Performance Summary
Overall, UserLD achieves the highest performance in terms of IA metrics and D#-metrics. As IA metrics and D#-metrics evaluate diversity by considering ranking relevance and subtopic coverage as a whole, so the improvement 435 of UserLD lies in two aspects. First, the results relevance indicated by standard relevance-based metrics is improved greatly compared with the baselines.
Second, the subtopic coverage indicated by I-rec is only slightly worse than a TopicLD model IA-Select(TopicLD) just because IA-Select(TopicLD) takes query subtopics in ground truth as input. The results mean that user search 440 intents are indeed diversified and different user intents indeed reflect different subtopics of a query. In summary, our approach has the ability of avoiding mining the query subtopics in advance while achieving almost the same or better performances compared with previous approaches.

445
The practical running time of our approach on INTENT dataset is shown in Figure 7. The raw Greedy for our model is too slow to run out, so we did not compare with it. As expected, the most time consuming task is done when choosing the first document. After that, the later documents can be chosen quickly. By comparison, as the number of search results k increases, raw Greedy 450 slows down dramatically due to the extra computation stated in Session 3.3.

Conclusions and Future Work
In this paper, we propose UserLD and implement UserLD with a Session Graph Construction phase and a Diversity Reranking phase. Extensive experiments demonstrate the effectiveness of our approach, which confirms that 455 UserLD can promote effective diversity while avoiding mining query subtopics or topic terms.
However, there are still at least two issues. First, only one-month query logs are available, so we do not know whether more query logs will further improve the performance or not. Second, although we adopt the Greedy algorithm, 460 however the properties of the current diversity loss function cannot guarantee a 1 − 1 e approximation. Either the diversity loss function or the optimization algorithm needs to be improved.

based on above deduction, we have
As a result,