## DisC Diversity

Result diversification has attracted considerable attention as a means of enhancing the quality of query results presented to users. Consider, for example, a user who wants to buy a camera and submits a related query. A diverse result, i.e., a result containing various brands and models with different pixel counts and other technical characteristics is intuitively more informative than a homogeneous result containing only cameras with similar features.

There have been various definitions of diversity, based on (i) content (or similarity), i.e., objects that are dissimilar to each other, (ii) novelty, i.e., objects that contain new information when compared to what was previously presented and (iii) semantic coverage, i.e., objects that belong to different categories or topics. Most previous approaches rely on assigning a diversity score to each object and then selecting either the k objects with the highest score for a given k or the objects with score larger than some threshold.

In this work, we address diversity through a different perspective. Let P be the set of objects in a query result. We consider two objects p_{1} and p_{2} in P to be **similar**, if dist(p_{1}, p_{2}) ≤ r for some distance function dist and real number r, where r is a tuning parameter that we call **radius**. Given P, we select a representative subset S ⊆ P to be presented to the user such that: (i) **all objects in P are similar with at least one object in S** and (ii) **no two objects in S are similar with each other**. The first condition ensures that all objects in P are represented, or **covered**, by at least one object in the selected subset. The second condition ensures that the selected objects of P are dissimilar, or **independent**. We call the set S **r-Dissimilar and Covering** subset or **r-DisC** diverse subset.

In contrary to previous approaches to diversification, we aim at computing subsets of objects that contain objects that are both dissimilar with each other and cover the whole result set. Furthermore, instead of specifying a required size k of the diverse set or a threshold, our tuning parameter r explicitly expresses the degree of diversification and determines the size of the diverse set. Increasing r results in smaller, more diverse subsets, while decreasing r results in larger, less diverse subsets. We call these operations, **zooming-out** and **zooming-in** respectively. One can also zoom-in or zoom-out locally to a specific object in the presented result.

Zooming. Selected objects are shown in bold. Circles denote

the radius r around the selected objects.

As an example, consider searching for cities in Greece. Figure 1 shows the results of this query diversified based on geographical location for an initial radius (a), after zooming-in (b), zooming-out (c) and local zooming-in a specific city (d).

We make the following contributions:

- we propose a new, intuitive definition of diversity, called DisC diversity and compare it with other models,
- we show that locating minimum DisC diverse subsets is an NP-hard problem and provide efficient heuristics along with approximation bounds,
- we introduce adaptive diversification through zooming-in and zooming-out and present algorithms for their incremental computation as well as corresponding theoretical bounds,
- we provide M-tree tailored algorithms and experimentally evaluate their performance.

### Related Publications:

- Marina Drosou and Evaggelia Pitoura, DisC Diversity: Result Diversification based on Dissimilarity and Coverage, in PVLDB, vol. 6, no.1 (to appear)

To be presented at the 39th International Conference on Very Large Data Bases (VLDB 2013), August 26-30, 2013, Riva del Garda, Trento, Italy. Granted the Reproducible Label by the VLDB 2013 reproducibility committee.

Source Code and scripts Description