## Diversifcation of Continuous Data

In many applications, such as in news alerts, RSS feeds and notifcation services in social networks, new data is **continuously** generated. In such cases, we need to define over which subsets of data items we apply diversification. In our work, we have considered three fundamental models for item delivery: (i) **periodic**, (ii) **sliding-window** and (iii) **history-based filtering** delivery. Given a budget of k, with periodic delivery, the k most diverse items are computed over disjoint periods of length T and are forwarded to the subscribers at the end of the period. With sliding-window delivery, the k most diverse items are computed over sliding windows of length w, so that, an item is forwarded, if and only if, it is part of the k most diverse items in any of the windows it belongs to. Finally, history-based filtering forwards new items as they are matched, if and only if, they are dissimilar enough to the k most diverse items recently delivered. The lengths T and w can be defined either in **time units** (e.g., as "the 10 most diverse items matched per hour" and, respectively, "the 10 most diverse items matched in the last hour") or in **number of items** (e.g., as "the 10 most diverse items per 100 matched ones" and, respectively, "the 10 most diverse items among the 100 most recently matched ones").

A jumping window for w=5 and j=3.

We also consider allowing windows not only to slide but also to "jump", i.e., move forward more than one position in the stream of matching items each time. We call these windows **jumping windows**. Assuming windows of length w and a jump step of length j, with j < w, consequent windows overlap and share w-j common items. Note that, for w = 1, we get sliding window delivery, while for w = T, we get periodic delivery.

One difficulty of computing diverse items over continuous data is that it cannot be done incrementally. To this end, we propose an index-based approach to the **dynamic** diversifcation problem, where insertions and deletions of items are allowed and the diverse subset needs to be refreshed to refect such updates. We also consider the **continuous** version of the problem, where diversifed sets are computed over streams of items. For streams, it is important that the items retrieved be the users during consequent logins exhibit some **continuity properties**. For example, the order in which the diverse items are delivered to the users should follow the order of their generation. Also, an item should not appear, disappear and then re-appear in the diverse set.

Diverse subserts computed by based on cover trees. Bold points

represent the items at each tree level, moving from lower

to higher levels as we move from top-left to bottom-right.

For the effcient computation of diverse results in a dynamic setting, we propose a solution based on **cover trees**. Cover trees are data structures proposed for approximate nearest-neighbor search. We provide **theoretical results** for the accuracy of the solution achieved using cover trees for solving the MaxMin diversification problem, namely we prove that the solution achieved by the cover tree is a (b-1)/(2b^2)-approximation of the optimal solution, where b is the base of the cover tree. Our experimental results show that our approach produces highly diversified results at reasonable time.

### Related Publications:

- Marina Drosou and Evaggelia Pitoura, Diversity over Continuous Data, in IEEE Data Engineering Bulletin, vol. 32, no. 4, pp. 49-56, 2009, IEEE Computer Society
- Marina Drosou and Evaggelia Pitoura, Dynamic Diversification of Continuous Data, in Proc. of the 15th International Conference on Extending Database Technology (EDBT 2012), March 27-30, 2012, Berlin, Germany