November 15, 2010 at 3:24 pm
How can I query for records that have Gaps in a date field.
id mydate name
1 01/15/10 test
2 01/16/10 test2
3 01/18/10 test3
4 01/18/10 test4
5 01/19/10 test5
query results:
ID mydate name
null 01/17/10 null
November 15, 2010 at 3:41 pm
Easiest way would be with a Tally table (see my sig) to build out a quick calendar table between the start and end date, and then look for nulls in the target table (your table).
If you'd like more tested or finished code, please check another link in my sig, the first one. It'll help you present the data in something consumable for us.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 15, 2010 at 6:09 pm
Craig Farrell (11/15/2010)
Easiest way would be with a Tally table
The easiest way might be with a tally table... but it's not the fastest way.
-- See how you start off by actually creating a table and then
-- inserting the data into it? Your doing this makes it a lot easier
-- for all of us volunteers to help you. So, help us help you.
-- See http://www.sqlservercentral.com/articles/Best+Practices/61537/
-- for more details on how to do all of this.
DECLARE @dates TABLE (id INT,
mydate datetime,
name varchar(5),
PRIMARY KEY CLUSTERED (mydate, id));
INSERT INTO @dates
SELECT 1, '01/15/10', 'test' UNION ALL
SELECT 2, '01/16/10', 'test2' UNION ALL
SELECT 3, '01/18/10', 'test3' UNION ALL
SELECT 4, '01/18/10', 'test4' UNION ALL
SELECT 5, '01/19/10', 'test5';
SELECT DISTINCT
GapStart = ISNULL((SELECT MAX(lo.mydate+1)
FROM @dates lo
WHERE lo.mydate < hi.mydate),1),
GapEnd = DateAdd(day, -1, hi.mydate)
FROM @dates hi
WHERE hi.mydate NOT IN (SELECT DateAdd(day, 1, mydate) FROM @dates)
AND hi.mydate > (SELECT MIN(mydate) FROM @dates);
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 16, 2010 at 3:34 am
I will test this out... I have a 100+ million record table I will test this out on.
thanks
November 16, 2010 at 8:55 am
Then I assume you will need to partition this by some other column (acct #?). Just ensure you have an appropriate index on the partitioning column(s) and the date column.
You'll probably have to add the partitioning column(s) to the two correlated subqueries.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 16, 2010 at 11:34 am
Hey Wayne, I would have to disagree regarding speed, especially with a secondary filter item, like an AccountNumber. I've generated a harness in a simplified real use format. Primary access being an Account Identifier (Number, what you like), and then secondarily in via the date. I've continued to use the date as part of the Clustered PK, though I usually would leave this off of one.
First, my test harness. I adapted your code to deal with the account number logic. Please double check me that it's not inaccurate? I am getting the same answers in a spot check between the two.
use tempdb
GO
IF OBJECT_ID( 'tempdb..#DateTest') IS NOT NULL
DROP TABLE #DateTest
-- Generate our test table...
CREATE TABLE #DateTest (IDINT IDENTITY( 1, 1) NOT NULL,
AccountNumberINTNOT NULL,
TestDateDATETIMENOT NULL,
Attribute1VARCHAR(50) NULL,
PRIMARY KEY CLUSTERED (AccountNumber, TestDate)
)
Go
TRUNCATE TABLE #DateTest
DECLARE @NumAccountsINT,
@DateSpanINT,
@NumRecsBIGINT
SELECT@NumAccounts= 10,
@DateSpan= 400
-- generate some gapped test data
INSERT INTO #DateTest
(AccountNumber, TestDate, Attribute1)
SELECT
t1.N,
DATEADD( dd, t2.N, 0) AS TestDate,
CAST( NEWID() AS VARCHAR(200)) AS Attribute1
--select count(*)
FROM
Tally AS t1,
Tally AS t2
WHERE
t1.N < @NumAccounts
AND t2.N < @DateSpan
AND t2.N%10 <= 7
SELECT @NumRecs = COUNT(*) FROM #DateTest
PRINT 'Test done for ' + CONVERT( VARCHAR(100), @NumAccounts) + ' unique accounts with ' + CONVERT( VARCHAR(100), @DateSpan) + ' days of data, with 4/10 days skipped.'
PRINT 'Testing for ' + CONVERT( VARCHAR(50), @NumRecs) + ' records.'
DECLARE @MissingDateRanges TABLE ( AccountNumber INT, GapStart DATETIME, GapEnd DATETIME)
DECLARE @MissingDates TABLE ( AccountNumber INT, DateMissing DATETIME)
SET STATISTICS IO, TIME ON
PRINT '***** Wayne''s Method ****'
INSERT INTO @MissingDateRanges
SELECT DISTINCT
AccountNumber,
GapStart = ISNULL((SELECTMAX(lo.TestDate + 1)
FROM#DateTest lo
WHERElo.TestDate < hi.TestDate
AND lo.AccountNumber = hi.AccountNumber),1),
GapEnd = DateAdd(day, -1, hi.TestDate)
FROM #DateTest hi
WHERE hi.TestDate NOT IN (SELECT DateAdd(day, 1, TestDate) FROM #DateTest WHERE AccountNumber = hi.AccountNumber)
AND hi.TestDate > (SELECT MIN(TestDate ) FROM #DateTest WHERE AccountNumber = hi.AccountNumber);
PRINT '***** Craig''s Method ****'
;WITH cte AS
(SELECT AccountNumber,
MIN( TestDate) AS MinDate,
MAX( TestDate) AS MaxDate
FROM #DateTest
GROUP BY
AccountNumber)
, cte2 AS
(SELECT
c.AccountNumber,
DATEADD( dd, 0, t.N) AS MissingDay
FROM
cte AS c,
tempdb..Tally AS t
WHERE
t.N BETWEEN DATEDIFF( dd, 0, c.MinDate) AND DATEDIFF( dd, 0, c.MaxDate)
)
Insert INTO @MissingDates
SELECT
c2.AccountNumber,
c2.MissingDay
FROM
cte2 AS c2
LEFT JOIN
#DateTest AS dt
ONc2.AccountNumber = dt.AccountNumber
AND c2.MissingDay = dt.TestDate
WHERE
dt.AccountNumber IS NULL
SET STATISTICS IO, TIME OFF
SELECT * FROM @MissingDateRanges
SELECT * FROM @MissingDates
Now, First run, 10 accounts/400 days:
Test done for 10 unique accounts with 400 days of data, with 4/10 days skipped.
Testing for 2871 records.
***** Wayne's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#40F9A68C'. Scan count 0, logical reads 352, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000013'. Scan count 362, logical reads 864, physical reads 0, read-ahead reads 6, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 397 ms.
***** Craig's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#41EDCAC5'. Scan count 0, logical reads 703, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 9, logical reads 27, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000013'. Scan count 2, logical reads 52, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 37 ms.
It scales really badly too, if you start expanding the # of days (# of accounts is mostly a linear result)
Test done for 10 unique accounts with 800 days of data, with 4/10 days skipped.
Testing for 5751 records.
***** Wayne's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#0880433F'. Scan count 0, logical reads 713, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000016'. Scan count 722, logical reads 1699, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 328 ms, elapsed time = 1598 ms.
***** Craig's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#09746778'. Scan count 0, logical reads 1425, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 9, logical reads 36, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000016'. Scan count 2, logical reads 98, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 70 ms.
For the massive # of rows Grasshopper needs to do, the Tally will probably provide a faster solution.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 16, 2010 at 12:43 pm
Craig - I'm testing... I'll get back later.
I'm using:
SELECT @NumAccounts = 10000,
@DateSpan = 4000
Test table total: 31,986,801 rows
FYI: I got this code from Jeff - about 2 months ago he got me on the same thing (I had done a tally table to get the gaps, the code he posted blew it away, so I now utilize his).
I know that this is faster; just gotta find out how to get it to that point.
In my testing on your initial test rig (1000/400), your code was just barely beating mine/Jeff's.
Edit: BTW, I'm glad to see this being tested instead of just being accepted!
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 16, 2010 at 1:17 pm
Some larger scale tests:
10k account, 1 year (400 days) 10x time diff. 31312 vs. 341944
Test done for 10000 unique accounts with 400 days of data, with 4/10 days skipped.
Testing for 3189681 records.
***** Wayne's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#1F63A897'. Scan count 0, logical reads 391358, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000017'. Scan count 389964, logical reads 1320638, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 84297 ms, elapsed time = 341944 ms.
***** Craig's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#2057CCD0'. Scan count 0, logical reads 781947, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 9999, logical reads 31263, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000017'. Scan count 2, logical reads 51741, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 7703 ms, elapsed time = 31312 ms.
1k account, 1 year (400 days) 1k ms vs. 60k ms
Test done for 1000 unique accounts with 400 days of data, with 4/10 days skipped.
Testing for 318681 records.
***** Wayne's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#43A1090D'. Scan count 0, logical reads 39100, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 999, logical reads 1602395, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000019'. Scan count 38964, logical reads 131976, physical reads 0, read-ahead reads 6, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 54453 ms, elapsed time = 60617 ms.
***** Craig's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#44952D46'. Scan count 0, logical reads 78124, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 999, logical reads 3139, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________000000000019'. Scan count 2, logical reads 5173, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 906 ms, elapsed time = 1009 ms.
10k account, 11 years (4k days) ... This didn't go well on my local... if we need I'll track down a more robust server I can goof off on.
Msg 9002, Level 17, State 4, Line 13
The transaction log for database 'tempdb' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases
Msg 9002, Level 17, State 4, Line 13
The transaction log for database 'tempdb' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 16, 2010 at 1:18 pm
Okay, that 31million row test table is making testing take too long.
Using:
SELECT @NumAccounts = 1000,
@DateSpan = 400
When doing this, I do get an interesting missing index hint on my method to use TestDate / AccountNumber. I've changed the PK to this combination.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 16, 2010 at 1:29 pm
WayneS (11/16/2010)
When doing this, I do get an interesting missing index hint on my method to use TestDate / AccountNumber. I've changed the PK to this combination.
My concern for this is rarely would you come in on a single date but for multiple accounts for a table like this. You would be more likely to come in with a single account for a date span.
Inverting the clustered index would shatter the majority of other queries to speed this one. While interesting in a technical exercise, it's not an index I would usually apply real world.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 16, 2010 at 2:18 pm
As an update:
This:
SELECT@NumAccounts= 1000,
@DateSpan= 4000
Is at an hour and climbing. Not sure which version is the problem child.
EDIT: Finally completed
Death to the lack of the Tally!
Test done for 1000 unique accounts with 4000 days of data, with 4/10 days skipped.
Testing for 3195801 records.
***** Wayne's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#57A801BA'. Scan count 0, logical reads 400029, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 999, logical reads 64044890, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________00000000001A'. Scan count 398604, logical reads 1348329, physical reads 0, read-ahead reads 4, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 5267937 ms, elapsed time = 5859709 ms.
***** Craig's Method ****
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#589C25F3'. Scan count 0, logical reads 799272, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#DateTest___________________________________________________________________________________________________________00000000001A'. Scan count 2, logical reads 51841, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 999, logical reads 9133, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 10640 ms, elapsed time = 11921 ms.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 16, 2010 at 3:24 pm
FYI, I've asked Jeff to poke his head in here.
Something that is really interesting in an odd way: if you build the test data first, and then separately run the two methods and view the execution plan... it shows my solution to be 2%, and yours 98%. Based on what we're seeing so far, that looks reversed! 🙁
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 16, 2010 at 3:49 pm
WayneS (11/16/2010)
FYI, I've asked Jeff to poke his head in here.Something that is really interesting in an odd way: if you build the test data first, and then separately run the two methods and view the execution plan... it shows my solution to be 2%, and yours 98%. Based on what we're seeing so far, that looks reversed! 🙁
These numbers are from a 100 acct/400 day test.
Hm, I thought we were building the test data first. Well, it's sitting there waiting to be used before we start our IO tests. Just an FYI, I flipped our methods just to make sure there wasn't a cache issue going on here, and it's still 10x faster even going first. Though, you're right, the Tally/Correlated method comparison in cost is showing 70%/30%. 2x the work for 10x the speed? Odd, but hey, I'll roll with it.
I've seen that strangeness before though. The cost in the Tally is primarily in the outer hash match/inner look join check, 55% and 21% of it right there. The primary cost in the correlated is in the 3 Index Scans it's doing atop the Index seek later, and then still getting killed in its own anti-semi join. The correlated method even carries about 10% of the data through the process compared to the Tally (31k in the correlated vs. 222k in the Tally one)
I'm at a loss here, from a time perspective. One extra scan should not be *that* expensive. The only nasty difference is the true (Left anti-semi join) that occurs as an object. I didn't think they were THAT expensive, compared to an outerjoin, then a filter.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 16, 2010 at 6:10 pm
Craig Farrell (11/16/2010)
Hm, I thought we were building the test data first.
Okay, I see that I didn't state that clearly enough.
In your script, it builds the test data and then executes the two methods. If you split this into separate parts (first, remark out the two methods; then remark out the building of the test data) so that when testing you are running only the two methods, then the execution plan shows this fascinatingly odd data.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 16, 2010 at 6:21 pm
WayneS (11/16/2010)
FYI, I've asked Jeff to poke his head in here.Something that is really interesting in an odd way: if you build the test data first, and then separately run the two methods and view the execution plan... it shows my solution to be 2%, and yours 98%. Based on what we're seeing so far, that looks reversed! 🙁
Sorry... just got home from work and got Wayne's PM. I've gotta eat dinner and a couple of other things and then I'll try to get to this to see what's going on.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply