August 14, 2013 at 4:33 am
The following can be pasted into SSMS so that it is both more readable, and easier get your teeth into the data.
/* System: SQL 2005
Problem: I have a table of data representing when people stayed in certain locations.
Its fine for multiple people to stay in the one location at the same time. People can arrive
and leave independently of one another. I need to find the "gaps" in the data where there was
nobody at a given location. A person can't be in two locations at once. Here's
some sample data, but the real data will be over a million rows long and the query will run
many times a day.
*/
--======= Create a test table
IF OBJECT_ID('tempdb..#stays') IS NOT NULL
DROP TABLE #stays
;
CREATE TABLE #stays(transaction_id int identity(1,1)
,location_id int not null
,person_id int not null
,start_dt datetime not null
,end_dt datetime null
CONSTRAINT [PK_#stay] PRIMARY KEY NONCLUSTERED (transaction_id ASC)
)
;
GO
--======= Add a unique constraint
CREATE UNIQUE CLUSTERED INDEX ix_person_id_start_dt ON #stays (person_id,start_dt);
--======= Add some sample data
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (1,1,'20130701 12:38','20130703 14:19');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (1,2,'20130702 01:32','20130704 05:20');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (1,7,'20130703 04:32','20130705 08:23');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (1,3,'20130707 14:35','20130710 09:39');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (1,4,'20130708 11:30','20130709 19:13');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (2,5,'20130701 16:31','20130702 14:31');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (2,6,'20130703 13:37','20130705 22:27');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (2,1,'20130708 19:32','20130711 04:56');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (2,2,'20130715 18:38','20130718 14:44');
INSERT #stays(location_id,person_id,start_dt,end_dt) VALUES (3,3,'20130712 01:32','20130712 21:19');
GO
/* The activity could be displayed graphically thus:
Date |-1--|-2--|-3--|-4-|-5--|-6--|-7--|-8--|-9--|-10-|-11-|-12-|-13-|-14-|-15-|-16-|-17-|-18-|
Loc 1 Psn 1 ***|---------| ********** ******************************************
Loc 1 Psn 2 *** |----------| ********** ******************************************
Loc 1 Psn 7 *** |---------|********** ******************************************
Loc 1 Psn 3 *** **********|--------------|******************************************
Loc 1 Psn 4 *** ********** |-----| ******************************************
______________________________________________________________________________________________________
Loc 2 Psn 5 **|-----|*** *************** ********************* **
Loc 2 Psn 6 ** ***|---------|*************** ********************* **
Loc 2 Psn 1 ** *** |-----------|********************* **
Loc 2 Psn 2 ** *** *************** *********************|--------------|**
______________________________________________________________________________________________________
Loc 3 Psn 3 *******************************************************|--|*******************************
*/
--======= The asterisks above represent the gaps at each location (times when it was unoccupied).
--======= I want a final dataset as follows:
/*
location_idunoccupied_start_dtunoccupied_end_dttransaction_type
1 NULL 2013-07-01 12:38 unoccupied
1 2013-07-05 08:232013-07-07 14:35unoccupied
1 2013-07-10 09:39NULL unoccupied
2 NULL 2013-07-01 16:31 unoccupied
2 2013-07-02 14:312013-07-03 13:37unoccupied
2 2013-07-05 22:272013-07-08 19:32unoccupied
2 2013-07-11 04:562013-07-15 18:38unoccupied
2 2013-07-18 14:44NULL unoccupied
3 NULL 2013-07-12 01:32 unoccupied
3 2013-07-12 21:19NULL unoccupied
*/
--======= This is no good when you get overlapping people at one location.
WITH ordered_starts as
(
SELECT transaction_id
,stay.location_id
,stay.person_id
,stay.start_dt
,stay.end_dt
,ROW_NUMBER() OVER(PARTITION BY stay.location_id
ORDER BY stay.start_dt) as seq
FROM #stays stay
)
SELECT prev.location_id
,prev.end_dt as unoccupied_start_dt
,nxt.start_dt as unoccupied_end_dt
,'unoccupied' as transaction_type
FROM ordered_starts prev
LEFT JOIN ordered_starts nxt ON prev.location_id = nxt.location_id
and prev.seq = nxt.seq - 1
ORDER BY prev.location_id
,prev.start_dt
;
So the obvious question is "How do I identify the gaps where the locations are unoccupied?
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
August 14, 2013 at 9:42 am
There's a little date arithmetic left for you in this one:
SELECT
location_id,
unoccupied_start_dt = CASE WHEN seq = 1 THEN NULL ELSE unoccupied_start_dt END,
unoccupied_end_dt = CASE WHEN seq = [Rows] THEN NULL ELSE unoccupied_end_dt END
FROM (
SELECT
location_id,
unoccupied_start_dt = MIN(Timespot),
unoccupied_end_dt = MAX(Timespot),
seq = ROW_NUMBER() OVER(PARTITION BY location_id ORDER BY TimeGroup),
[Rows] = COUNT(*) OVER(PARTITION BY location_id)
FROM (
SELECT
s.location_id, s.MIN_start_dt, s.MAX_end_dt,
x.Timespot,
TimeGroup = DATEADD(minute,1-ROW_NUMBER() OVER(PARTITION BY s.location_id ORDER BY x.Timespot), x.Timespot)
FROM (SELECT location_id, MIN_start_dt = MIN(start_dt), MAX_end_dt = MAX(end_dt) FROM #stays GROUP BY location_id) s
CROSS APPLY (
SELECT TOP (DATEDIFF(minute,s.MIN_start_dt,s.MAX_end_dt)+3)
TimeSpot = DATEADD(minute,-2 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),s.MIN_start_dt)
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
) x
WHERE NOT EXISTS (SELECT 1 FROM #stays l WHERE l.location_id = s.location_id
AND x.Timespot BETWEEN l.start_dt AND l.end_dt)
) d
GROUP BY location_id, TimeGroup
) o
I'm willing to bet DwainC has a better solution than this, he's just had a Gaps'n'Islands paper published over on Simple Talk.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 14, 2013 at 3:34 pm
Thanks Chris. I'll go away and digest it. The VALUES... syntax doesn't work in SQL 2005, but it looks suspiciously like a tally table (which I already have) so I might be able to work around it.
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
August 14, 2013 at 4:11 pm
OK. So I'm trying to understand table value constructors and cross apply. If I do this (on a SQL 2008 machine):
SELECT *
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
...I get 100,000 rows and 5 columns. All the columns are called n and all the values are zero. What does this do? Why do I need 5 identical columns?
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
August 14, 2013 at 7:17 pm
ChrisM@Work (8/14/2013)
I'm willing to bet DwainC has a better solution than this, he's just had a Gaps'n'Islands paper published over on Simple Talk.
You just had to call me out on this one didn't you?
I must rise to the challenge.
-- Islands method by Itzik Ben-Gan
-- http://www.solidq.com/sqj/Pages/2011-March-Issue/Packing-Intervals.aspx
WITH C1 AS (
SELECT location_id, ts, Type
,e=CASE Type WHEN 1 THEN NULL ELSE ROW_NUMBER() OVER (PARTITION BY location_id, Type ORDER BY end_dt) END
,s=CASE Type WHEN -1 THEN NULL ELSE ROW_NUMBER() OVER (PARTITION BY location_id, Type ORDER BY start_dt) END
FROM -- #stays
--CROSS APPLY (
-- VALUES (1, start_dt), (-1, end_dt)) a(Type, ts)
(
SELECT location_id, [Type]=1, ts=start_dt, start_dt, end_dt
FROM #stays
UNION ALL
SELECT location_id, -1, end_dt, start_dt, end_dt
FROM #stays
) a
),
C2 AS (
SELECT C1.*
,se=ROW_NUMBER() OVER (PARTITION BY location_id ORDER BY ts, Type DESC)
FROM C1),
C3 AS (
SELECT location_id, ts
,grpnm=FLOOR((ROW_NUMBER() OVER (PARTITION BY location_id ORDER BY ts)-1) / 2 + 1)
FROM C2
WHERE COALESCE(s-(se-s)-1, (se-e)-e) = 0),
C4 AS (
SELECT location_id, StartDate=MIN(ts), EndDate=MAX(ts)
FROM C3
GROUP BY location_id, grpnm)
-- Convert the Islands from C4 into gaps (method by Dwain.C)
-- https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/
SELECT location_id, StartDate=MIN([date]), EndDate=MAX([date])
FROM (
SELECT location_id, [date]
,rn=ROW_NUMBER() OVER (PARTITION BY location_id ORDER BY [date])/2
FROM (
-- Normally this would be a CROSS APPLY VALUES but this works in SQL 2005
SELECT location_id, [date]=StartDate
FROM C4
UNION ALL
SELECT location_id, [date]=EndDate
FROM C4
) a
) a
GROUP BY location_id, rn
HAVING COUNT(*) = 2
;
Edit: Ooops! Forgot to change Mr. Ben-Gan's CROSS APPLY VALUES to something compatible with SQL 2005.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 14, 2013 at 7:27 pm
GPO (8/14/2013)
OK. So I'm trying to understand table value constructors and cross apply. If I do this (on a SQL 2008 machine):
SELECT *
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
...I get 100,000 rows and 5 columns. All the columns are called n and all the values are zero. What does this do? Why do I need 5 identical columns?
Try it like this:
SELECT n=ROW_NUMBER() OVER (ORDER BY (SELECT $))
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
And you find it generates a 100,000 row tally table.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 14, 2013 at 7:51 pm
GPO (8/14/2013)
OK. So I'm trying to understand table value constructors and cross apply. If I do this (on a SQL 2008 machine):
SELECT *
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
...I get 100,000 rows and 5 columns. All the columns are called n and all the values are zero. What does this do? Why do I need 5 identical columns?
Going just a little deeper...
You're not actually supposed to use any data from that. You're only supposed to use the presence of rows instead of writing a loop (and it will blow the doors off of any explicit loop or recursive CTE). R. Barry Young coined a phase for such a thing as "Pseudo Cursor".
If you modify the SELECT list just a bit, you can see that it's a readless Tally Table. Adding TOP(some#) (which I didn't do below) to the SELECT will allow you to control the limit of the 1 to N domain.
SELECT N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n)
CROSS JOIN (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n)
Try the same thing with a While Loop or Recursive CTE and compare the resource usage and duration to the above.
--Jeff Moden
Change is inevitable... Change for the better is not.
August 14, 2013 at 8:24 pm
Jeff doesn't like it when I use
SELECT $
In my tally tables. :hehe: Just having a little fun there.
BTW. I noticed you also wanted the open-ended gaps before start and after end. From your comments in the code you posted:
--======= I want a final dataset as follows:
/*
location_idunoccupied_start_dtunoccupied_end_dttransaction_type
1 NULL 2013-07-01 12:38 unoccupied
1 2013-07-05 08:232013-07-07 14:35unoccupied
1 2013-07-10 09:39NULL unoccupied
2 NULL 2013-07-01 16:31 unoccupied
2 2013-07-02 14:312013-07-03 13:37unoccupied
2 2013-07-05 22:272013-07-08 19:32unoccupied
2 2013-07-11 04:562013-07-15 18:38unoccupied
2 2013-07-18 14:44NULL unoccupied
3 NULL 2013-07-12 01:32 unoccupied
3 2013-07-12 21:19NULL unoccupied
Those are easy enough to add by tacking the following onto the end of the code I posted (after the HAVING clause).
UNION ALL
SELECT location_id, NULL, MIN(start_dt)
FROM #stays
GROUP BY location_id
UNION ALL
SELECT location_id, MAX(end_dt), NULL
FROM #stays
GROUP BY location_id
ORDER BY location_id, StartDate, EndDate
I got so focused on the gaps thingy that I missed that little wrinkle on first scan.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 15, 2013 at 12:26 am
Awesome to have the heavy hitters on the case! Jeff, where you say:
You're not actually supposed to use any data from that. You're only supposed to use the presence of rows instead of writing a loop
I understand that, but if you run that it returns 5 columns of zeros. All called [n]. And I can't work out why that's necessary. Surely one column of zeros would be enough.
Edit: I think I get it now! The cross joins mean that 10^5 rows are generated, and you're saying it's irrelevant that 5 meaningless columns just happen to be generated. Doing an extra CROSS JOIN would presumably result in 6 meaningless columns and a million rows. Does that sound right?
Edit 2:
a readless Tally Table
So that means a tally table that doesn't have to be read from the disk, and is presumably therefore faster than the permanent tally table I usually use. How am I going so far?
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
August 15, 2013 at 12:39 am
GPO (8/15/2013)
Awesome to have the heavy hitters on the case! Jeff, where you say:You're not actually supposed to use any data from that. You're only supposed to use the presence of rows instead of writing a loop
I understand that, but if you run that it returns 5 columns of zeros. All called [n]. And I can't work out why that's necessary. Surely one column of zeros would be enough.
Edit: I think I get it now! The cross joins mean that 10^5 rows are generated, and you're saying it's irrelevant that 5 meaningless columns just happen to be generated. Doing an extra CROSS JOIN would presumably result in 6 meaningless columns and a million rows. Does that sound right?
I'll try to answer.
The CROSS JOINs are building a Cartesian product of rows 10 at a time. 10x10x10x10x10
Each CROSS JOIN adds an additional column to the result set because the table returned by the table row constructor must have a named column in it. They could be named anything - Chris just chose to name each one n.
However, since all of the values in each of the 5 columns is zero, that isn't what you need. What you need is the ROW_NUMBER, which is your Tally table.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 15, 2013 at 12:42 am
GPO (8/15/2013)
Edit: I think I get it now! The cross joins mean that 10^5 rows are generated, and you're saying it's irrelevant that 5 meaningless columns just happen to be generated. Doing an extra CROSS JOIN would presumably result in 6 meaningless columns and a million rows. Does that sound right?Edit 2:
a readless Tally Table
So that means a tally table that doesn't have to be read from the disk, and is presumably therefore faster than the permanent tally table I usually use. How am I going so far?
Seems that your edit was on the mark.
Here's a link you can look at that is comparing in-line vs. disk tally tables. The result I'd say is that it depends.
http://www.sqlservercentral.com/scripts/tally/100338/
I believe the comparison code is in an attached resource file.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 15, 2013 at 12:48 am
Hi Dwain
I'm enormously grateful for the code you've put up. I'll test yours and Chris's and see what I can learn from them. I'll post back my observations after some time for reflection...(he said clinging for dear life to the learning curve)
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
August 15, 2013 at 12:58 am
GPO (8/15/2013)
Hi DwainI'm enormously grateful for the code you've put up. I'll test yours and Chris's and see what I can learn from them. I'll post back my observations after some time for reflection...(he said clinging for dear life to the learning curve)
I'd be interested to know how the 2 compare on performance. I'm not so sure mine will stand up, primarily because of the adaptations I had to make to be SQL 2005 compatible.
Hopefully we'll see.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 15, 2013 at 1:23 am
GPO (8/15/2013)
Hi DwainI'm enormously grateful for the code you've put up. I'll test yours and Chris's and see what I can learn from them. I'll post back my observations after some time for reflection...(he said clinging for dear life to the learning curve)
Dwain's quite capable of providing an American English description of how his code works. Here's an English description of mine ๐
For each LocationID, find the earliest and latest date in the set. Subtract one interval from the earliest and add one interval to the latest. An interval for this exercise is defined as one minute.
Generate a row for each interval between these two dates - a set of dates incrementing by one minute from the start date (minus a minute) to the end date (plus a minute).
Remove rows from the list which are between the start date and end date of any visit for the locationID. This will leave a date range with gaps in it, where the gaps correspond to visits.
Divine the start and end date of each contiguous date range remaining.
Finally, process the start and end date to generate the NULLs shown in your example.
Nice easy query to finish off a busy day with ๐
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
August 15, 2013 at 1:51 am
Chris how would you change yours to work on SQL 2005 (no cross apply and table value constructors)?
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply