Saturday, August 12, 2017

Advanced SQL: Group Overlapping Events Using Analytic Window Functions

Problem: Group overlapping events

Given a list of events with start and end times, group them by their overlap. The problem is easier to understand with the following graph.

Solution 1: Use cross self-join

I found the following forums with a similar solution.
Here is my own implementation of the solution.
Even though the code is surprisingly short, it is very hard to understand or explain. I am intrigued to develop an alternative solution.

Solution 2: Use analytic window functions

Sort events by their start and end times. Then loop through the event sequence 3 times.
  1. In the 1st loop, at each row (event), calculate the max value of its previous events' end times (prev_max_end). This can be done using a running total calculation, an analytic function with a window clause (ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING). Compare prev_max_end against the current row's start time. If the start time > prev_max_end, then the current event starts a new group. In other words, a group boundary is detected at the current row. In the above graph, the detection is positive at Event 4. Event 1 automatically starts a new group because its pre_max_end is NULL.
  2. In the 2nd loop, at each row (event), count the running total of group boundaries. Again, magic of window clause (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). Every event within the same start and end boundaries will have the same count. For example, events 4, 5 and 6 all get running totals of 2, because they all see the boundaries at events 1 and 4. The running totals, therefore, can be used as group IDs.
  3. Finally in the 3rd loop, find min(start time) and max(end time) within each group as the group's start and end times.
Here is an implementation.
It is very easy to verify and understand the solution step-by-step on a demo execution, as shown below.


Performance comparison

In addition to easiness of understanding and explaining, solution 2 also has a much better performance. Solution 1 has 3 cross self joins, which result in O(N4) total row counts. On the other hand, solution 2 has no joins. It just loops through the same number of rows 3 times, O(N). On a 2000-events data set, solution 1 runs for 12 seconds, and solution 2 less than 1 second.

No comments: