January 15, 2012 at 11:17 pm
Step-by-step guide to the method I used:
DECLARE @Trading TABLE
(
ID integer NOT NULL PRIMARY KEY,
[Date] date NOT NULL,
[Time] time NOT NULL,
[Price] smallmoney NOT NULL,
[AskClient] integer NOT NULL,
[BidClient] integer NOT NULL
)
INSERT @Trading
(ID, Date, Time, Price, AskClient, BidClient)
VALUES
(01, '2000-01-01', '08:00:00', $6.50, 1, 1),
(02, '2000-01-01', '09:00:00', $6.50, 1, 2),
(03, '2000-01-01', '10:00:00', $6.50, 2, 3), -- different Ask
(04, '2000-01-01', '11:00:00', $6.50, 1, 4),
(05, '2000-01-01', '12:00:00', $6.50, 1, 5),
(06, '2000-01-01', '13:00:00', $6.50, 1, 1),
(07, '2000-01-01', '14:00:00', $6.70, 1, 2), -- different Price
(08, '2000-01-01', '15:00:00', $6.50, 1, 3),
(09, '2000-01-01', '16:00:00', $6.50, 1, 4),
(10, '2000-01-01', '17:00:00', $6.50, 1, 5),
(11, '2000-01-02', '08:00:00', $6.50, 1, 1), -- different Date
(12, '2000-01-02', '09:00:00', $6.50, 1, 2),
(13, '2000-01-02', '10:00:00', $6.50, 1, 3),
(14, '2000-01-02', '11:00:00', $6.50, 1, 4),
(15, '2000-01-02', '12:00:00', $6.50, 1, 5);
-- Number rows in order of date and time
-- In case two rows could have the same date and time,
-- use the ID column value to break the tie:
SELECT
dt_row_number =
ROW_NUMBER() OVER (
ORDER BY t.Date, t.Time, t.ID),
*
FROM @Trading AS t;
-- Number the rows again by date, time, and ID
-- but reset the count for each new combination
-- of AskClient, Date, and Price
SELECT
ask_date_price_dt_row_number =
ROW_NUMBER() OVER (
PARTITION BY t.AskClient, t.Date, t.Price
ORDER BY t.Date, t.Time, t.ID),
dt_row_number =
ROW_NUMBER() OVER (
ORDER BY t.Date, t.Time, t.ID),
*
FROM @Trading AS t;
-- Subtract one row number from the other
-- and notice the pattern: each group of
-- sequential entries for an AskClient,
-- Date, and Price has the same value
SELECT
row_number_difference =
ROW_NUMBER() OVER (
ORDER BY t.Date, t.Time, t.ID)
-
ROW_NUMBER() OVER (
PARTITION BY t.AskClient, t.Date, t.Price
ORDER BY t.Date, t.Time, t.ID),
*
FROM @Trading AS t;
-- Group the records up and show the range of IDs
SELECT
sq1.row_number_difference,
range_start_id = MIN(sq1.ID),
range_end_id = MAX(sq1.ID),
range_count = COUNT_BIG(*)
FROM
(
SELECT
row_number_difference =
ROW_NUMBER() OVER (
ORDER BY t.Date, t.Time, t.ID)
-
ROW_NUMBER() OVER (
PARTITION BY t.AskClient, t.Date, t.Price
ORDER BY t.Date, t.Time, t.ID),
*
FROM @Trading AS t
) AS sq1
GROUP BY
sq1.row_number_difference;
-- Filter out records with a range count less than 5
-- and join back to the original table to get details
SELECT
t.Date,
t.Price,
t.AskClient,
sq2.range_start_id,
sq2.range_end_id,
sq2.range_count
FROM
(
SELECT
sq1.row_number_difference,
range_start_id = MIN(sq1.ID),
range_end_id = MAX(sq1.ID),
range_count = COUNT_BIG(*)
FROM
(
SELECT
row_number_difference =
ROW_NUMBER() OVER (
ORDER BY t.Date, t.Time, t.ID)
-
ROW_NUMBER() OVER (
PARTITION BY t.AskClient, t.Date, t.Price
ORDER BY t.Date, t.Time, t.ID),
*
FROM @Trading AS t
) AS sq1
GROUP BY
sq1.row_number_difference
HAVING
COUNT_BIG(*) >= 5
) AS sq2
JOIN @Trading AS t ON
t.ID = sq2.range_start_id;
-- Written using CTEs instead of subqueries
WITH Numbered AS
(
SELECT
GroupID =
ROW_NUMBER() OVER (ORDER BY t.Date, t.Time, t.ID) -
ROW_NUMBER() OVER (PARTITION BY t.AskClient, t.Date, t.Price ORDER BY t.Date, t.Time, t.ID),
*
FROM @Trading AS t
), Ranges AS
(
SELECT
MIN(N.ID) AS range_start_id,
MAX(N.ID) AS range_end_id,
COUNT_BIG(*) AS range_count
FROM Numbered AS N
GROUP BY N.GroupID
HAVING COUNT_BIG(*) >= 5
)
SELECT
t.Date,
t.Price,
t.AskClient,
r.range_start_id,
r.range_end_id,
r.range_count
FROM Ranges AS r
JOIN @Trading AS t ON
t.ID = r.range_start_id
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
January 18, 2012 at 8:28 pm
Hi Paul,
Sorry for not respond for a few days....
It takes a time for me to learn your solution,cause i'm new to many sql server keywords such as ROW_NUMBER and partition, but your solution works like a charm...
Thank you very much for your help...
Case close...:-)
January 20, 2012 at 2:00 pm
It was worth spending a little time on this to compare island-n-gap resolution by conventional means against a rCTE method.
Firstly, this thread isn't island-n-gaps because the sequence column is part of the island partition. You only get islands when the sequence is external to the partition. Secondly, the results are interesting - well they are to me.
Here's the standard million row testbed, based on OP's data tweaked to diminish the number of "islands". The data is sequenced and indexed to offer maximum benefit to each query:
DROP TABLE TRADING
CREATE TABLE Trading (
ID integer,
[Date] date NOT NULL,
[Time] time NOT NULL,
[Price] smallmoney NOT NULL,
[AskClient] integer NOT NULL,
[BidClient] integer NOT NULL)
IF OBJECT_ID('tempdb.dbo.#Trading','U') IS NOT NULL DROP TABLE #Trading
SELECT ID = ROW_NUMBER() OVER(ORDER BY [date], [Time], [price], AskClient, Bidclient),
[date], [time], Price, AskClient, Bidclient
INTO #Trading
FROM (
SELECT x='1', [Date] = CAST('Jun 24 2011 12:00AM' AS DATE), [Time] = CAST('Jan 14 2012 10:00AM' AS Time), [Price] = '650', [AskClient] = '100001', [BidClient] = '100002' UNION ALL
SELECT '2', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:01AM', '650', '100001', '100002' UNION ALL
SELECT '3', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:02AM', '650', '100001', '100007' UNION ALL
SELECT '4', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:03AM', '650', '100001', '100008' UNION ALL
SELECT '5', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:04AM', '650', '100001', '100009' UNION ALL
SELECT '6', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:05AM', '650', '100002', '100010' UNION ALL
SELECT '7', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:06AM', '650', '100001', '100002' UNION ALL
SELECT '8', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:07AM', '650', '100001', '100008' UNION ALL
SELECT '9', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:08AM', '650', '100001', '100009' UNION ALL
SELECT '10', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:09AM', '650', '100001', '100010' UNION ALL
SELECT '11', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:10AM', '650', '100001', '100007' UNION ALL
SELECT '12', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:11AM', '650', '100003', '100002' UNION ALL
SELECT '13', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:12AM', '650', '100001', '100008' UNION ALL
SELECT '14', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:13AM', '650', '100001', '100009' UNION ALL
SELECT '15', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:14AM', '650', '100001', '100010' UNION ALL
SELECT '16', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:15AM', '650', '100001', '100007' UNION ALL
SELECT '17', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:16AM', '650', '100001', '100002' UNION ALL
SELECT '18', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:17AM', '650', '100004', '100008' UNION ALL
SELECT '19', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:18AM', '650', '100001', '100009' UNION ALL
SELECT '20', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:19AM', '650', '100001', '100010' UNION ALL
SELECT '21', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:20AM', '650', '100001', '100007' UNION ALL
SELECT '22', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:21AM', '650', '100001', '100002' UNION ALL
SELECT '23', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:22AM', '650', '100001', '100002' UNION ALL
SELECT '24', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:23AM', '650', '100005', '100002' UNION ALL
SELECT '25', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:24AM', '650', '100001', '100002' UNION ALL
SELECT '26', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:25AM', '650', '100001', '100002' UNION ALL
SELECT '27', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:26AM', '650', '100001', '100002' UNION ALL
SELECT '28', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:27AM', '650', '100001', '100002' UNION ALL
SELECT '29', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:28AM', '650', '100001', '100002' UNION ALL
SELECT '30', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:29AM', '650', '100006', '100002' UNION ALL
SELECT '31', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:30AM', '650', '100001', '100002' UNION ALL
SELECT '32', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:31AM', '650', '100001', '100002' UNION ALL
SELECT '33', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:32AM', '650', '100001', '100002' UNION ALL
SELECT '34', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:33AM', '650', '100001', '100002' UNION ALL
SELECT '35', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:34AM', '650', '100001', '100002' UNION ALL
SELECT '36', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:35AM', '650', '100002', '100002' UNION ALL
SELECT '37', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:36AM', '650', '100001', '100002' UNION ALL
SELECT '38', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:37AM', '650', '100001', '100002' UNION ALL
SELECT '39', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:38AM', '650', '100001', '100002' UNION ALL
SELECT '40', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:39AM', '650', '100001', '100002' UNION ALL
SELECT '41', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:40AM', '650', '100001', '100002' UNION ALL
SELECT '42', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:41AM', '650', '100002', '100002' UNION ALL
SELECT '43', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:42AM', '650', '100001', '100002' UNION ALL
SELECT '44', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:43AM', '650', '100004', '100002' UNION ALL
SELECT '45', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:44AM', '650', '100001', '100002' UNION ALL
SELECT '46', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:45AM', '650', '100001', '100002' UNION ALL
SELECT '47', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:46AM', '650', '100001', '100002' UNION ALL
SELECT '48', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:47AM', '650', '100001', '100002' UNION ALL
SELECT '49', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:48AM', '650', '100001', '100002' UNION ALL
SELECT '50', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:49AM', '650', '100001', '100002') d
ORDER BY [date], [time], [price], AskClient
INSERT INTO Trading ([ID], [date], [time], Price, AskClient, Bidclient)
SELECT ID = ROW_NUMBER() OVER(ORDER BY [date], [Time], [price], AskClient, Bidclient),
[date], [time], Price, AskClient, Bidclient
FROM (
SELECT [date] = DATEADD(dd,9*x.rn,[date]), [time], Price, AskClient, Bidclient
FROM #Trading
CROSS JOIN (SELECT TOP 20000 rn = ROW_NUMBER() OVER (ORDER BY @@SPID) -1 from sys.columns a, sys.columns b) x
) d
ORDER BY CAST([date] AS DATETIME), CAST([Time] AS Time), [price], AskClient, Bidclient
CREATE UNIQUE CLUSTERED INDEX [CXI_ID] ON Trading ([ID])
CREATE NONCLUSTERED INDEX [IX_dti] ON Trading ([date], [time], [ID])
CREATE NONCLUSTERED INDEX [IX_dpa] ON Trading ([date], [Price], [AskClient])
Here's the queries.
Query1 is Paul's, the now well-known dual window function method:
;WITH Numbered AS
(
SELECT
GroupID =
ROW_NUMBER() OVER (ORDER BY t.Date, t.Time, t.ID) -
ROW_NUMBER() OVER (PARTITION BY t.AskClient, t.Date, t.Price ORDER BY t.Date, t.Time, t.ID),
*
FROM Trading AS t
), Ranges AS
(
SELECT
MIN(N.ID) AS range_start_id,
MAX(N.ID) AS range_end_id,
COUNT_BIG(*) AS range_count
FROM Numbered AS N
GROUP BY N.GroupID
HAVING COUNT_BIG(*) >= 5
)
SELECT
t.Date,
t.Price,
t.AskClient,
r.range_start_id,
r.range_end_id,
r.range_count
FROM Ranges AS r
JOIN Trading AS t ON
t.ID = r.range_start_id
ORDER BY t.ID
-- 60,000 / 00:00:08
Query2 is a rCTE running through the whole table row by ag.. row, incrementing the number of each new island by comparing values to the last row. It's not efficient but it's thorough and reliable, and quick to set up as a testbed for the validation of more efficient methods:
PRINT 'Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------'
;WITH rCTE AS (
SELECT a.ID, a.[Date], a.Price, a.AskClient,
RangeGroup = CAST(1 AS BIGINT)
FROM Trading a
WHERE ID = 1
UNION ALL
SELECT tr.ID, tr.[Date], tr.Price, tr.AskClient,
RangeGroup = CASE WHEN tr.[Date] = lr.[Date] AND tr.Price = lr.Price AND tr.AskClient = lr.AskClient THEN lr.RangeGroup ELSE lr.RangeGroup+1 END
FROM Trading tr
INNER JOIN rCTE lr ON lr.ID +1 = tr.ID
)
SELECT t.[Date], t.price, t.askclient,
range_start_rn,
range_end_rn,
range_count
FROM (
SELECT
range_start_rn = MIN(t.ID),
range_end_rn = MAX(t.ID),
range_count = COUNT(*)
FROM rCTE t
GROUP BY RangeGroup
HAVING COUNT(*) > 4
) [Range]
INNER JOIN Trading t ON t.ID = [Range].range_start_rn
ORDER BY t.ID
OPTION (MAXRECURSION 0)
-- 60,000 / 00:00:27
Query 3 is a rCTE tweaked a bit, shamelessly taking hints from PW's super-fast distinct:
PRINT 'Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------'
;WITH rCTE AS (
SELECT a.ID, a.[Date], a.Price, a.AskClient,
[Lastrow] = CAST(0 AS INT)
FROM Trading a
WHERE a.[ID] = 1
UNION ALL
SELECT a.ID, a.[Date], a.Price, a.AskClient,
a.[Lastrow]
FROM ( -- skip to the first row of the next Date/Price/AskClient partition
SELECT tr.*, [Lastrow] = lr.ID, rn2 = ROW_NUMBER() OVER (ORDER BY tr.ID)
FROM Trading tr
INNER JOIN rCTE lr ON tr.ID > lr.ID
WHERE NOT (tr.[Date] = lr.[Date] AND tr.Price = lr.Price AND tr.AskClient = lr.AskClient)
) a
WHERE rn2 = 1
)
SELECT t.[Date], t.Price, t.AskClient,
range_start_rn = r.[lastrow],
range_end_rn = r.ID-1,
range_count = r.ID - r.[Lastrow]
FROM rCTE r
INNER JOIN Trading t ON t.ID = r.[lastrow]
WHERE r.ID - r.[Lastrow] > 4
OPTION (MAXRECURSION 0)
-- 60,000 / 00:00:13
Query4 is a plain aggregate - because the solution to the OP's problem isn't island-n-gaps.
PRINT 'Query 4 straight aggregate -----------------------------------------------------------------------------------------------------------'
SELECT [date], [price], AskClient,
range_start_rn = MIN(t.ID),
range_end_rn = MAX(t.ID),
range_count = COUNT(*)
FROM Trading t
GROUP BY [date], [price], AskClient
HAVING COUNT(*) > 4
ORDER BY MIN(t.ID)
-- 60,000 / 00:00:01
Here are the results:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(60000 row(s) affected)
Table 'Trading'. Scan count 3, logical reads 187895, physical reads 0, read-ahead reads 9, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 7535 ms, elapsed time = 9044 ms.
Query 1 ===========================================================================================================
Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(60000 row(s) affected)
Table 'Trading'. Scan count 0, logical reads 3180003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 23291 ms, elapsed time = 27005 ms.
Query 2 ===========================================================================================================
Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(60000 row(s) affected)
Table 'Trading'. Scan count 500000, logical reads 1684085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 3000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 12340 ms, elapsed time = 13360 ms.
Query 3 ===========================================================================================================
Query 4 straight aggregate -----------------------------------------------------------------------------------------------------------
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(60000 row(s) affected)
Table 'Trading'. Scan count 1, logical reads 2610, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 764 ms, elapsed time = 1860 ms.
Query 4 ===========================================================================================================
So what can you conclude from this?
- don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.
- rCTE's ain't too bad, 13s for Q3 vs 9s for Q1
-and I'm not yet convinced that the rCTE can't be further tweaked for performance.
Cheers
ChrisM
Edit: updated the sample data generation.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 20, 2012 at 2:18 pm
ChrisM@home (1/20/2012)
- don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.
Chris,
The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 20, 2012 at 2:28 pm
Evil Kraig F (1/20/2012)
ChrisM@home (1/20/2012)
- don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.Chris,
The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.
Hi Craig
Dupes ordered by [date] - which is one of the 3 columns to be evaluated together as a dupe. That means no islands, right?
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 20, 2012 at 2:36 pm
ChrisM@home (1/20/2012)
Evil Kraig F (1/20/2012)
ChrisM@home (1/20/2012)
- don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.Chris,
The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.
Hi Craig
Dupes ordered by [date] - which is one of the 3 columns to be evaluated together as a dupe. That means no islands, right?
If it was built off the TIME column described above instead of DATE, I'd agree. In this particular case it was by insertion order (since his examples was based off the identity field), but TIME would remove false positives from within the same day (at least as far as I'd consider the issue, but I don't know the original requirements and how over-simplified this is). The DATE field is not specific enough in the sample data to avoid false positives.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 20, 2012 at 4:22 pm
Evil Kraig F (1/20/2012)
ChrisM@home (1/20/2012)
Evil Kraig F (1/20/2012)
ChrisM@home (1/20/2012)
- don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.Chris,
The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.
Hi Craig
Dupes ordered by [date] - which is one of the 3 columns to be evaluated together as a dupe. That means no islands, right?
If it was built off the TIME column described above instead of DATE, I'd agree. In this particular case it was by insertion order (since his examples was based off the identity field), but TIME would remove false positives from within the same day (at least as far as I'd consider the issue, but I don't know the original requirements and how over-simplified this is). The DATE field is not specific enough in the sample data to avoid false positives.
Since the OP's concluded that Paul's code works, and Paul's code sorts on date then time, then you are of course correct, Craig. I changed the first 5 rows of the sample set by inserting a new row after row 1, as follows:
SELECT x = '0', [Date] = 'Jun 24 2011 12:00AM', [Time] = 'Jan 14 2012 10:40AM', [Price] = '650', [AskClient] = '100001', [BidClient] = '100002' UNION ALL
SELECT '1', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:41AM', '650', '100002', '100002' UNION ALL
SELECT '2', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:42AM', '650', '100001', '100007' UNION ALL
SELECT '3', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:43AM', '650', '100001', '100008' UNION ALL
SELECT '4', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:44AM', '650', '100001', '100009' UNION ALL
SELECT '5', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:45AM', '650', '100001', '100010' UNION ALL
It's row x='1' above, the old row is renamed to x='0'.
This breaks a 5-row set based on date then time, leaving 2 remaining valid 5-row sets in the 50-row seed data. The 3 islands-n-gaps queries yield 40k rows as you'd expect, the aggregate yields 60k rows. Sorted, thanks.
Point 2 remains, I'll just have to wait until an islands-n-gaps problem comes up, with longer runs in the islands 😉
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 20, 2012 at 4:46 pm
ChrisM@home (1/20/2012)
This breaks a 5-row set based on date then time, leaving 2 remaining valid 5-row sets in the 50-row seed data. The 3 islands-n-gaps queries yield 40k rows as you'd expect, the aggregate yields 60k rows. Sorted, thanks.Point 2 remains, I'll just have to wait until an islands-n-gaps problem comes up, with longer runs in the islands 😉
Sorry to have been the fly in your ointment, but I did like your testing. 🙂
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
January 21, 2012 at 12:44 am
Evil Kraig F (1/20/2012)
ChrisM@home (1/20/2012)
This breaks a 5-row set based on date then time, leaving 2 remaining valid 5-row sets in the 50-row seed data. The 3 islands-n-gaps queries yield 40k rows as you'd expect, the aggregate yields 60k rows. Sorted, thanks.Point 2 remains, I'll just have to wait until an islands-n-gaps problem comes up, with longer runs in the islands 😉
Sorry to have been the fly in your ointment, but I did like your testing. 🙂
Not at all :blush: thanks for the catch!
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
January 23, 2012 at 7:22 am
Here's some results of testing the usual method against a rCTE
Tests were performed against a million rows, consisting of a frame of 50 seed rows repeated with non-overlapping dates.
With 5 islands in the 50-row frame:
DatePriceAskClientrange_start_idrange_end_idrange_count
2011-06-24650.0010000111111
2011-06-24650.00100001132311
2011-06-24650.0010000125295
2011-06-24650.0010000131355
2011-06-24650.0010000137415
Query 1 Standard dual window-function method-----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 3, logical reads 310395, physical reads 0, read-ahead reads 4, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6694 ms, elapsed time = 9206 ms.
Query 1 ===========================================================================================================
Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 0, logical reads 3300003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 23400 ms, elapsed time = 28736 ms.
Query 2 ===========================================================================================================
Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 280000, logical reads 1144085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 1680001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 7738 ms, elapsed time = 9638 ms.
Query 3 ===========================================================================================================
With 4 islands in the frame (joined islands 3 & 4), the result looks like this:
DatePriceAskClientrange_start_idrange_end_idrange_count
2011-06-24650.0010000111111
2011-06-24650.00100001132311
2011-06-24650.00100001253511
2011-06-24650.0010000137415
And the performance is like this:
Query 1 Standard dual window-function method-----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 3, logical reads 249145, physical reads 0, read-ahead reads 4, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 5865 ms, elapsed time = 8414 ms.
Query 1 ===========================================================================================================
Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 0, logical reads 3240003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 23494 ms, elapsed time = 28361 ms.
Query 2 ===========================================================================================================
Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 240000, logical reads 964085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 1440001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6801 ms, elapsed time = 8056 ms.
Query 3 ===========================================================================================================
Last run, two large islands, results:
DatePriceAskClientrange_start_idrange_end_idrange_count
2011-06-24650.0010000112323
2011-06-24650.00100001254117
and performance:
Query 1 Standard dual window-function method-----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 3, logical reads 126645, physical reads 0, read-ahead reads 6, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6085 ms, elapsed time = 8593 ms.
Query 1 ===========================================================================================================
Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 0, logical reads 3120003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 23447 ms, elapsed time = 27771 ms.
Query 2 ===========================================================================================================
Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------
Table 'Trading'. Scan count 160000, logical reads 604085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 960001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 4758 ms, elapsed time = 5538 ms.
Query 3 ===========================================================================================================
If the proportion of rows comprising islands is significantly greater than 50%, then a rCTE is the way to go.
For better assistance in answering your questions, please read this[/url].
Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
Viewing 10 posts - 16 through 24 (of 24 total)
You must be logged in to reply to this topic. Login to reply