Problem: Group overlapping events
Given a list of events with start and end times, group them by their overlap. The problem is easier to understand with the following graph.Solution 1: Use cross self-join
I found the following forums with a similar solution.
- https://stackoverflow.com/questions/2561130/merge-overlapping-date-intervals
- https://www.sqlservercentral.com/Forums/Topic826031-8-1.aspx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
s1.StartTime, | |
MIN(t1.EndTime) AS EndTime | |
FROM Event_Demo s1 | |
INNER JOIN Event_Demo t1 ON s1.StartTime <= t1.EndTime | |
AND NOT EXISTS(SELECT * FROM Event_Demo t2 | |
WHERE t1.EndTime >= t2.StartTime AND t1.EndTime < t2.EndTime) | |
WHERE NOT EXISTS(SELECT * FROM Event_Demo s2 | |
WHERE s1.StartTime > s2.StartTime AND s1.StartTime <= s2.EndTime) | |
GROUP BY s1.StartTime | |
ORDER BY s1.StartTime |
Solution 2: Use analytic window functions
Sort events by their start and end times. Then loop through the event sequence 3 times.- In the 1st loop, at each row (event), calculate the max value of its previous events' end times (prev_max_end). This can be done using a running total calculation, an analytic function with a window clause (ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING). Compare prev_max_end against the current row's start time. If the start time > prev_max_end, then the current event starts a new group. In other words, a group boundary is detected at the current row. In the above graph, the detection is positive at Event 4. Event 1 automatically starts a new group because its pre_max_end is NULL.
- In the 2nd loop, at each row (event), count the running total of group boundaries. Again, magic of window clause (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). Every event within the same start and end boundaries will have the same count. For example, events 4, 5 and 6 all get running totals of 2, because they all see the boundaries at events 1 and 4. The running totals, therefore, can be used as group IDs.
- Finally in the 3rd loop, find min(start time) and max(end time) within each group as the group's start and end times.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
with detect_group_boundary as ( | |
select | |
EventID, | |
StartTime, | |
EndTime, | |
max(EndTime) over (order by StartTime, EndTime rows between unbounded preceding and 1 preceding) as prev_max_end, | |
case when StartTime > isnull(max(EndTime) over (order by StartTime, EndTime rows between unbounded preceding and 1 preceding), StartTime - 1) then 'Y' end as new_group | |
from Event_Demo | |
), count_running_group_boundary as ( | |
select | |
*, | |
count(new_group) over (order by StartTime, EndTime rows between unbounded preceding and current row) as group_id | |
from detect_group_boundary | |
) | |
select | |
*, | |
min(StartTime) over (partition by group_id) as group_start, | |
max(EndTime) over (partition by group_id) as group_end | |
from count_running_group_boundary | |
order by StartTime, EndTime |
It is very easy to verify and understand the solution step-by-step on a demo execution, as shown below.
Performance comparison
In addition to easiness of understanding and explaining, solution 2 also has a much better performance. Solution 1 has 3 cross self joins, which result in O(N4) total row counts. On the other hand, solution 2 has no joins. It just loops through the same number of rows 3 times, O(N). On a 2000-events data set, solution 1 runs for 12 seconds, and solution 2 less than 1 second.