Comparing data row by row....

  • Step-by-step guide to the method I used:

    DECLARE @Trading TABLE

    (

    ID integer NOT NULL PRIMARY KEY,

    [Date] date NOT NULL,

    [Time] time NOT NULL,

    [Price] smallmoney NOT NULL,

    [AskClient] integer NOT NULL,

    [BidClient] integer NOT NULL

    )

    INSERT @Trading

    (ID, Date, Time, Price, AskClient, BidClient)

    VALUES

    (01, '2000-01-01', '08:00:00', $6.50, 1, 1),

    (02, '2000-01-01', '09:00:00', $6.50, 1, 2),

    (03, '2000-01-01', '10:00:00', $6.50, 2, 3), -- different Ask

    (04, '2000-01-01', '11:00:00', $6.50, 1, 4),

    (05, '2000-01-01', '12:00:00', $6.50, 1, 5),

    (06, '2000-01-01', '13:00:00', $6.50, 1, 1),

    (07, '2000-01-01', '14:00:00', $6.70, 1, 2), -- different Price

    (08, '2000-01-01', '15:00:00', $6.50, 1, 3),

    (09, '2000-01-01', '16:00:00', $6.50, 1, 4),

    (10, '2000-01-01', '17:00:00', $6.50, 1, 5),

    (11, '2000-01-02', '08:00:00', $6.50, 1, 1), -- different Date

    (12, '2000-01-02', '09:00:00', $6.50, 1, 2),

    (13, '2000-01-02', '10:00:00', $6.50, 1, 3),

    (14, '2000-01-02', '11:00:00', $6.50, 1, 4),

    (15, '2000-01-02', '12:00:00', $6.50, 1, 5);

    -- Number rows in order of date and time

    -- In case two rows could have the same date and time,

    -- use the ID column value to break the tie:

    SELECT

    dt_row_number =

    ROW_NUMBER() OVER (

    ORDER BY t.Date, t.Time, t.ID),

    *

    FROM @Trading AS t;

    -- Number the rows again by date, time, and ID

    -- but reset the count for each new combination

    -- of AskClient, Date, and Price

    SELECT

    ask_date_price_dt_row_number =

    ROW_NUMBER() OVER (

    PARTITION BY t.AskClient, t.Date, t.Price

    ORDER BY t.Date, t.Time, t.ID),

    dt_row_number =

    ROW_NUMBER() OVER (

    ORDER BY t.Date, t.Time, t.ID),

    *

    FROM @Trading AS t;

    -- Subtract one row number from the other

    -- and notice the pattern: each group of

    -- sequential entries for an AskClient,

    -- Date, and Price has the same value

    SELECT

    row_number_difference =

    ROW_NUMBER() OVER (

    ORDER BY t.Date, t.Time, t.ID)

    -

    ROW_NUMBER() OVER (

    PARTITION BY t.AskClient, t.Date, t.Price

    ORDER BY t.Date, t.Time, t.ID),

    *

    FROM @Trading AS t;

    -- Group the records up and show the range of IDs

    SELECT

    sq1.row_number_difference,

    range_start_id = MIN(sq1.ID),

    range_end_id = MAX(sq1.ID),

    range_count = COUNT_BIG(*)

    FROM

    (

    SELECT

    row_number_difference =

    ROW_NUMBER() OVER (

    ORDER BY t.Date, t.Time, t.ID)

    -

    ROW_NUMBER() OVER (

    PARTITION BY t.AskClient, t.Date, t.Price

    ORDER BY t.Date, t.Time, t.ID),

    *

    FROM @Trading AS t

    ) AS sq1

    GROUP BY

    sq1.row_number_difference;

    -- Filter out records with a range count less than 5

    -- and join back to the original table to get details

    SELECT

    t.Date,

    t.Price,

    t.AskClient,

    sq2.range_start_id,

    sq2.range_end_id,

    sq2.range_count

    FROM

    (

    SELECT

    sq1.row_number_difference,

    range_start_id = MIN(sq1.ID),

    range_end_id = MAX(sq1.ID),

    range_count = COUNT_BIG(*)

    FROM

    (

    SELECT

    row_number_difference =

    ROW_NUMBER() OVER (

    ORDER BY t.Date, t.Time, t.ID)

    -

    ROW_NUMBER() OVER (

    PARTITION BY t.AskClient, t.Date, t.Price

    ORDER BY t.Date, t.Time, t.ID),

    *

    FROM @Trading AS t

    ) AS sq1

    GROUP BY

    sq1.row_number_difference

    HAVING

    COUNT_BIG(*) >= 5

    ) AS sq2

    JOIN @Trading AS t ON

    t.ID = sq2.range_start_id;

    -- Written using CTEs instead of subqueries

    WITH Numbered AS

    (

    SELECT

    GroupID =

    ROW_NUMBER() OVER (ORDER BY t.Date, t.Time, t.ID) -

    ROW_NUMBER() OVER (PARTITION BY t.AskClient, t.Date, t.Price ORDER BY t.Date, t.Time, t.ID),

    *

    FROM @Trading AS t

    ), Ranges AS

    (

    SELECT

    MIN(N.ID) AS range_start_id,

    MAX(N.ID) AS range_end_id,

    COUNT_BIG(*) AS range_count

    FROM Numbered AS N

    GROUP BY N.GroupID

    HAVING COUNT_BIG(*) >= 5

    )

    SELECT

    t.Date,

    t.Price,

    t.AskClient,

    r.range_start_id,

    r.range_end_id,

    r.range_count

    FROM Ranges AS r

    JOIN @Trading AS t ON

    t.ID = r.range_start_id

  • Hi Paul,

    Sorry for not respond for a few days....

    It takes a time for me to learn your solution,cause i'm new to many sql server keywords such as ROW_NUMBER and partition, but your solution works like a charm...

    Thank you very much for your help...

    Case close...:-)

  • It was worth spending a little time on this to compare island-n-gap resolution by conventional means against a rCTE method.

    Firstly, this thread isn't island-n-gaps because the sequence column is part of the island partition. You only get islands when the sequence is external to the partition. Secondly, the results are interesting - well they are to me.

    Here's the standard million row testbed, based on OP's data tweaked to diminish the number of "islands". The data is sequenced and indexed to offer maximum benefit to each query:

    DROP TABLE TRADING

    CREATE TABLE Trading (

    ID integer,

    [Date] date NOT NULL,

    [Time] time NOT NULL,

    [Price] smallmoney NOT NULL,

    [AskClient] integer NOT NULL,

    [BidClient] integer NOT NULL)

    IF OBJECT_ID('tempdb.dbo.#Trading','U') IS NOT NULL DROP TABLE #Trading

    SELECT ID = ROW_NUMBER() OVER(ORDER BY [date], [Time], [price], AskClient, Bidclient),

    [date], [time], Price, AskClient, Bidclient

    INTO #Trading

    FROM (

    SELECT x='1', [Date] = CAST('Jun 24 2011 12:00AM' AS DATE), [Time] = CAST('Jan 14 2012 10:00AM' AS Time), [Price] = '650', [AskClient] = '100001', [BidClient] = '100002' UNION ALL

    SELECT '2', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:01AM', '650', '100001', '100002' UNION ALL

    SELECT '3', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:02AM', '650', '100001', '100007' UNION ALL

    SELECT '4', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:03AM', '650', '100001', '100008' UNION ALL

    SELECT '5', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:04AM', '650', '100001', '100009' UNION ALL

    SELECT '6', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:05AM', '650', '100002', '100010' UNION ALL

    SELECT '7', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:06AM', '650', '100001', '100002' UNION ALL

    SELECT '8', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:07AM', '650', '100001', '100008' UNION ALL

    SELECT '9', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:08AM', '650', '100001', '100009' UNION ALL

    SELECT '10', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:09AM', '650', '100001', '100010' UNION ALL

    SELECT '11', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:10AM', '650', '100001', '100007' UNION ALL

    SELECT '12', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:11AM', '650', '100003', '100002' UNION ALL

    SELECT '13', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:12AM', '650', '100001', '100008' UNION ALL

    SELECT '14', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:13AM', '650', '100001', '100009' UNION ALL

    SELECT '15', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:14AM', '650', '100001', '100010' UNION ALL

    SELECT '16', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:15AM', '650', '100001', '100007' UNION ALL

    SELECT '17', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:16AM', '650', '100001', '100002' UNION ALL

    SELECT '18', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:17AM', '650', '100004', '100008' UNION ALL

    SELECT '19', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:18AM', '650', '100001', '100009' UNION ALL

    SELECT '20', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:19AM', '650', '100001', '100010' UNION ALL

    SELECT '21', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:20AM', '650', '100001', '100007' UNION ALL

    SELECT '22', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:21AM', '650', '100001', '100002' UNION ALL

    SELECT '23', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:22AM', '650', '100001', '100002' UNION ALL

    SELECT '24', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:23AM', '650', '100005', '100002' UNION ALL

    SELECT '25', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:24AM', '650', '100001', '100002' UNION ALL

    SELECT '26', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:25AM', '650', '100001', '100002' UNION ALL

    SELECT '27', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:26AM', '650', '100001', '100002' UNION ALL

    SELECT '28', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:27AM', '650', '100001', '100002' UNION ALL

    SELECT '29', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:28AM', '650', '100001', '100002' UNION ALL

    SELECT '30', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:29AM', '650', '100006', '100002' UNION ALL

    SELECT '31', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:30AM', '650', '100001', '100002' UNION ALL

    SELECT '32', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:31AM', '650', '100001', '100002' UNION ALL

    SELECT '33', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:32AM', '650', '100001', '100002' UNION ALL

    SELECT '34', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:33AM', '650', '100001', '100002' UNION ALL

    SELECT '35', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:34AM', '650', '100001', '100002' UNION ALL

    SELECT '36', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:35AM', '650', '100002', '100002' UNION ALL

    SELECT '37', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:36AM', '650', '100001', '100002' UNION ALL

    SELECT '38', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:37AM', '650', '100001', '100002' UNION ALL

    SELECT '39', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:38AM', '650', '100001', '100002' UNION ALL

    SELECT '40', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:39AM', '650', '100001', '100002' UNION ALL

    SELECT '41', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:40AM', '650', '100001', '100002' UNION ALL

    SELECT '42', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:41AM', '650', '100002', '100002' UNION ALL

    SELECT '43', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:42AM', '650', '100001', '100002' UNION ALL

    SELECT '44', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:43AM', '650', '100004', '100002' UNION ALL

    SELECT '45', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:44AM', '650', '100001', '100002' UNION ALL

    SELECT '46', 'Jun 30 2011 12:00AM', 'Jan 14 2012 10:45AM', '650', '100001', '100002' UNION ALL

    SELECT '47', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:46AM', '650', '100001', '100002' UNION ALL

    SELECT '48', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:47AM', '650', '100001', '100002' UNION ALL

    SELECT '49', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:48AM', '650', '100001', '100002' UNION ALL

    SELECT '50', 'Jul 1 2011 12:00AM', 'Jan 14 2012 10:49AM', '650', '100001', '100002') d

    ORDER BY [date], [time], [price], AskClient

    INSERT INTO Trading ([ID], [date], [time], Price, AskClient, Bidclient)

    SELECT ID = ROW_NUMBER() OVER(ORDER BY [date], [Time], [price], AskClient, Bidclient),

    [date], [time], Price, AskClient, Bidclient

    FROM (

    SELECT [date] = DATEADD(dd,9*x.rn,[date]), [time], Price, AskClient, Bidclient

    FROM #Trading

    CROSS JOIN (SELECT TOP 20000 rn = ROW_NUMBER() OVER (ORDER BY @@SPID) -1 from sys.columns a, sys.columns b) x

    ) d

    ORDER BY CAST([date] AS DATETIME), CAST([Time] AS Time), [price], AskClient, Bidclient

    CREATE UNIQUE CLUSTERED INDEX [CXI_ID] ON Trading ([ID])

    CREATE NONCLUSTERED INDEX [IX_dti] ON Trading ([date], [time], [ID])

    CREATE NONCLUSTERED INDEX [IX_dpa] ON Trading ([date], [Price], [AskClient])

    Here's the queries.

    Query1 is Paul's, the now well-known dual window function method:

    ;WITH Numbered AS

    (

    SELECT

    GroupID =

    ROW_NUMBER() OVER (ORDER BY t.Date, t.Time, t.ID) -

    ROW_NUMBER() OVER (PARTITION BY t.AskClient, t.Date, t.Price ORDER BY t.Date, t.Time, t.ID),

    *

    FROM Trading AS t

    ), Ranges AS

    (

    SELECT

    MIN(N.ID) AS range_start_id,

    MAX(N.ID) AS range_end_id,

    COUNT_BIG(*) AS range_count

    FROM Numbered AS N

    GROUP BY N.GroupID

    HAVING COUNT_BIG(*) >= 5

    )

    SELECT

    t.Date,

    t.Price,

    t.AskClient,

    r.range_start_id,

    r.range_end_id,

    r.range_count

    FROM Ranges AS r

    JOIN Trading AS t ON

    t.ID = r.range_start_id

    ORDER BY t.ID

    -- 60,000 / 00:00:08

    Query2 is a rCTE running through the whole table row by ag.. row, incrementing the number of each new island by comparing values to the last row. It's not efficient but it's thorough and reliable, and quick to set up as a testbed for the validation of more efficient methods:

    PRINT 'Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------'

    ;WITH rCTE AS (

    SELECT a.ID, a.[Date], a.Price, a.AskClient,

    RangeGroup = CAST(1 AS BIGINT)

    FROM Trading a

    WHERE ID = 1

    UNION ALL

    SELECT tr.ID, tr.[Date], tr.Price, tr.AskClient,

    RangeGroup = CASE WHEN tr.[Date] = lr.[Date] AND tr.Price = lr.Price AND tr.AskClient = lr.AskClient THEN lr.RangeGroup ELSE lr.RangeGroup+1 END

    FROM Trading tr

    INNER JOIN rCTE lr ON lr.ID +1 = tr.ID

    )

    SELECT t.[Date], t.price, t.askclient,

    range_start_rn,

    range_end_rn,

    range_count

    FROM (

    SELECT

    range_start_rn = MIN(t.ID),

    range_end_rn = MAX(t.ID),

    range_count = COUNT(*)

    FROM rCTE t

    GROUP BY RangeGroup

    HAVING COUNT(*) > 4

    ) [Range]

    INNER JOIN Trading t ON t.ID = [Range].range_start_rn

    ORDER BY t.ID

    OPTION (MAXRECURSION 0)

    -- 60,000 / 00:00:27

    Query 3 is a rCTE tweaked a bit, shamelessly taking hints from PW's super-fast distinct:

    PRINT 'Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------'

    ;WITH rCTE AS (

    SELECT a.ID, a.[Date], a.Price, a.AskClient,

    [Lastrow] = CAST(0 AS INT)

    FROM Trading a

    WHERE a.[ID] = 1

    UNION ALL

    SELECT a.ID, a.[Date], a.Price, a.AskClient,

    a.[Lastrow]

    FROM ( -- skip to the first row of the next Date/Price/AskClient partition

    SELECT tr.*, [Lastrow] = lr.ID, rn2 = ROW_NUMBER() OVER (ORDER BY tr.ID)

    FROM Trading tr

    INNER JOIN rCTE lr ON tr.ID > lr.ID

    WHERE NOT (tr.[Date] = lr.[Date] AND tr.Price = lr.Price AND tr.AskClient = lr.AskClient)

    ) a

    WHERE rn2 = 1

    )

    SELECT t.[Date], t.Price, t.AskClient,

    range_start_rn = r.[lastrow],

    range_end_rn = r.ID-1,

    range_count = r.ID - r.[Lastrow]

    FROM rCTE r

    INNER JOIN Trading t ON t.ID = r.[lastrow]

    WHERE r.ID - r.[Lastrow] > 4

    OPTION (MAXRECURSION 0)

    -- 60,000 / 00:00:13

    Query4 is a plain aggregate - because the solution to the OP's problem isn't island-n-gaps.

    PRINT 'Query 4 straight aggregate -----------------------------------------------------------------------------------------------------------'

    SELECT [date], [price], AskClient,

    range_start_rn = MIN(t.ID),

    range_end_rn = MAX(t.ID),

    range_count = COUNT(*)

    FROM Trading t

    GROUP BY [date], [price], AskClient

    HAVING COUNT(*) > 4

    ORDER BY MIN(t.ID)

    -- 60,000 / 00:00:01

    Here are the results:

      Query 1 Standard dual window-function method-----------------------------------------------------------------------------------------------------------

      SQL Server Execution Times:

      CPU time = 0 ms, elapsed time = 0 ms.

      (60000 row(s) affected)

      Table 'Trading'. Scan count 3, logical reads 187895, physical reads 0, read-ahead reads 9, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:

      CPU time = 7535 ms, elapsed time = 9044 ms.

      Query 1 ===========================================================================================================

      Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------

      SQL Server Execution Times:

      CPU time = 0 ms, elapsed time = 0 ms.

      (60000 row(s) affected)

      Table 'Trading'. Scan count 0, logical reads 3180003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:

      CPU time = 23291 ms, elapsed time = 27005 ms.

      Query 2 ===========================================================================================================

      Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------

      SQL Server Execution Times:

      CPU time = 0 ms, elapsed time = 0 ms.

      (60000 row(s) affected)

      Table 'Trading'. Scan count 500000, logical reads 1684085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      Table 'Worktable'. Scan count 2, logical reads 3000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:

      CPU time = 12340 ms, elapsed time = 13360 ms.

      Query 3 ===========================================================================================================

      Query 4 straight aggregate -----------------------------------------------------------------------------------------------------------

      SQL Server Execution Times:

      CPU time = 0 ms, elapsed time = 0 ms.

      (60000 row(s) affected)

      Table 'Trading'. Scan count 1, logical reads 2610, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

      SQL Server Execution Times:

      CPU time = 764 ms, elapsed time = 1860 ms.

      Query 4 ===========================================================================================================

    So what can you conclude from this?

    - don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.

    - rCTE's ain't too bad, 13s for Q3 vs 9s for Q1

    -and I'm not yet convinced that the rCTE can't be further tweaked for performance.

    Cheers

    ChrisM

    Edit: updated the sample data generation.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (1/20/2012)


    - don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.

    Chris,

    The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Evil Kraig F (1/20/2012)


    ChrisM@home (1/20/2012)


    - don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.

    Chris,

    The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.

    Hi Craig

    Dupes ordered by [date] - which is one of the 3 columns to be evaluated together as a dupe. That means no islands, right?


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (1/20/2012)


    Evil Kraig F (1/20/2012)


    ChrisM@home (1/20/2012)


    - don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.

    Chris,

    The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.

    Hi Craig

    Dupes ordered by [date] - which is one of the 3 columns to be evaluated together as a dupe. That means no islands, right?

    If it was built off the TIME column described above instead of DATE, I'd agree. In this particular case it was by insertion order (since his examples was based off the identity field), but TIME would remove false positives from within the same day (at least as far as I'd consider the issue, but I don't know the original requirements and how over-simplified this is). The DATE field is not specific enough in the sample data to avoid false positives.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Evil Kraig F (1/20/2012)


    ChrisM@home (1/20/2012)


    Evil Kraig F (1/20/2012)


    ChrisM@home (1/20/2012)


    - don't treat a problem as islands-n-gaps if it isn't, because whatever method you use to resolve it, it's going to cost you.

    Chris,

    The only problem with this statement is he needed only sequential dupes to be counted together, not just any repeats. It's not a straight dupe check.

    Hi Craig

    Dupes ordered by [date] - which is one of the 3 columns to be evaluated together as a dupe. That means no islands, right?

    If it was built off the TIME column described above instead of DATE, I'd agree. In this particular case it was by insertion order (since his examples was based off the identity field), but TIME would remove false positives from within the same day (at least as far as I'd consider the issue, but I don't know the original requirements and how over-simplified this is). The DATE field is not specific enough in the sample data to avoid false positives.

    Since the OP's concluded that Paul's code works, and Paul's code sorts on date then time, then you are of course correct, Craig. I changed the first 5 rows of the sample set by inserting a new row after row 1, as follows:

    SELECT x = '0', [Date] = 'Jun 24 2011 12:00AM', [Time] = 'Jan 14 2012 10:40AM', [Price] = '650', [AskClient] = '100001', [BidClient] = '100002' UNION ALL

    SELECT '1', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:41AM', '650', '100002', '100002' UNION ALL

    SELECT '2', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:42AM', '650', '100001', '100007' UNION ALL

    SELECT '3', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:43AM', '650', '100001', '100008' UNION ALL

    SELECT '4', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:44AM', '650', '100001', '100009' UNION ALL

    SELECT '5', 'Jun 24 2011 12:00AM', 'Jan 14 2012 10:45AM', '650', '100001', '100010' UNION ALL

    It's row x='1' above, the old row is renamed to x='0'.

    This breaks a 5-row set based on date then time, leaving 2 remaining valid 5-row sets in the 50-row seed data. The 3 islands-n-gaps queries yield 40k rows as you'd expect, the aggregate yields 60k rows. Sorted, thanks.

    Point 2 remains, I'll just have to wait until an islands-n-gaps problem comes up, with longer runs in the islands 😉


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home (1/20/2012)


    This breaks a 5-row set based on date then time, leaving 2 remaining valid 5-row sets in the 50-row seed data. The 3 islands-n-gaps queries yield 40k rows as you'd expect, the aggregate yields 60k rows. Sorted, thanks.

    Point 2 remains, I'll just have to wait until an islands-n-gaps problem comes up, with longer runs in the islands 😉

    Sorry to have been the fly in your ointment, but I did like your testing. 🙂


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Evil Kraig F (1/20/2012)


    ChrisM@home (1/20/2012)


    This breaks a 5-row set based on date then time, leaving 2 remaining valid 5-row sets in the 50-row seed data. The 3 islands-n-gaps queries yield 40k rows as you'd expect, the aggregate yields 60k rows. Sorted, thanks.

    Point 2 remains, I'll just have to wait until an islands-n-gaps problem comes up, with longer runs in the islands 😉

    Sorry to have been the fly in your ointment, but I did like your testing. 🙂

    Not at all :blush: thanks for the catch!


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Here's some results of testing the usual method against a rCTE

    Tests were performed against a million rows, consisting of a frame of 50 seed rows repeated with non-overlapping dates.

    With 5 islands in the 50-row frame:

    DatePriceAskClientrange_start_idrange_end_idrange_count

    2011-06-24650.0010000111111

    2011-06-24650.00100001132311

    2011-06-24650.0010000125295

    2011-06-24650.0010000131355

    2011-06-24650.0010000137415

    Query 1 Standard dual window-function method-----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 3, logical reads 310395, physical reads 0, read-ahead reads 4, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 6694 ms, elapsed time = 9206 ms.

    Query 1 ===========================================================================================================

    Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 0, logical reads 3300003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 23400 ms, elapsed time = 28736 ms.

    Query 2 ===========================================================================================================

    Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 280000, logical reads 1144085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 1680001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 7738 ms, elapsed time = 9638 ms.

    Query 3 ===========================================================================================================

    With 4 islands in the frame (joined islands 3 & 4), the result looks like this:

    DatePriceAskClientrange_start_idrange_end_idrange_count

    2011-06-24650.0010000111111

    2011-06-24650.00100001132311

    2011-06-24650.00100001253511

    2011-06-24650.0010000137415

    And the performance is like this:

    Query 1 Standard dual window-function method-----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 3, logical reads 249145, physical reads 0, read-ahead reads 4, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 5865 ms, elapsed time = 8414 ms.

    Query 1 ===========================================================================================================

    Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 0, logical reads 3240003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 23494 ms, elapsed time = 28361 ms.

    Query 2 ===========================================================================================================

    Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 240000, logical reads 964085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 1440001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 6801 ms, elapsed time = 8056 ms.

    Query 3 ===========================================================================================================

    Last run, two large islands, results:

    DatePriceAskClientrange_start_idrange_end_idrange_count

    2011-06-24650.0010000112323

    2011-06-24650.00100001254117

    and performance:

    Query 1 Standard dual window-function method-----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 3, logical reads 126645, physical reads 0, read-ahead reads 6, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 6085 ms, elapsed time = 8593 ms.

    Query 1 ===========================================================================================================

    Query 2 bogstandard RB(a)R rCTE-----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 0, logical reads 3120003, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 6000001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 23447 ms, elapsed time = 27771 ms.

    Query 2 ===========================================================================================================

    Query 3 tweaked rCTE -----------------------------------------------------------------------------------------------------------

    Table 'Trading'. Scan count 160000, logical reads 604085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 960001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 4758 ms, elapsed time = 5538 ms.

    Query 3 ===========================================================================================================

    If the proportion of rows comprising islands is significantly greater than 50%, then a rCTE is the way to go.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply