Grouping by min/max in a series

  • Good Afternoon;

    Any chance I can do the following in T-SQL. I've done something similar, and as usual, can't find the source.

    Table:

    Code    Status

    1       S

    2       T

    3       T

    4       T

    5       S

    6       S

    7       S

    10      A

    11      T

    12      T

    .

    .

    .

    Results desired:

    min     max     status

    1       1       S

    2       4       T

    5       7       S

    10      10      A

    11      14      T

    .

    .

    appreciate any thoughts...

    Randy.

     

  • Does column Code contain unique ascending sequence of values ?

  • Yes. There may be skips (1,2,3,5,6,7,11,12,13), but they are unique, ascending.

     

     

  • This one is a little ugly but it should work.  Depending on your real purpose this might be a job for a cursor.  Otherwise, if you are using SQL 2005 you can do it with a PARTITION statement (Google for "SQL count consecutive rows same value").  Otherwise, here is a shot:

    SELECT DISTINCT 

    ISNULL(

     (SELECT TOP 1 PriorTable.Code AS PriorCodeStatus

      FROM MyTable AS PriorTable

      WHERE PriorTable.Code <= MyTable.Code

       AND PriorTable.Status = MyTable.Status

       AND PriorTable.Code >=

        ISNULL(

        (SELECT TOP 1 PrevTable.Code FROM MyTable AS PrevTable

        WHERE PrevTable.Code < MyTable.Code

        AND PrevTable.Status <> MyTable.Status

        ORDER BY Code Desc),0)

      ORDER BY Code),0)

    AS MinCode,

    ISNULL(

     (SELECT TOP 1 MaxTable.Code AS MaxCodeStatus

      FROM MyTable AS MaxTable

      WHERE MaxTable.Code >= MyTable.Code

       AND MaxTable.Status = MyTable.Status

       AND MaxTable.Code <=

        ISNULL(

        (SELECT TOP 1 NxtTable.Code FROM MyTable AS NxtTable

        WHERE NxtTable.Code > MyTable.Code

        AND NxtTable.Status <> MyTable.Status

        ORDER BY Code),Code)

      ORDER BY Code DESC),0)

    AS MaxCode, Status

    FROM MyTable

    ORDER BY MinCode

    Basically it is a series of nested TOP 1 queries which determine the first, last, previous and next Code for a given Status grouping.

    Hope that helps.

    Brian

  • Brian -

    Sorry for the long pause, but thanks, that was the trick! I'm breaking it down now to 'burn -in' the logic.

     

     

  • The following is similar to Brian's code but should be more efficient. If you have more than 10000 rows you may be better off using cursors/temp tables.

    SELECT MIN(D.MinCode) AS MinCode

        ,D.MaxCode

        ,D.status

    FROM (

            SELECT T1.code AS MinCode

                ,MAX(T2.code) AS MaxCode

                ,T1.status

            FROM YourTable T1

                JOIN YourTable T2

                    ON T1.status = T2.status

                        AND T1.code <= T2.code

            WHERE T1.status = ALL (

                    SELECT T3.status

                    FROM YourTable T3

                    WHERE T3.code BETWEEN T1.code AND T2.code

                )

            GROUP BY T1.code, T1.status

        ) D

    GROUP BY D.MaxCode, D.status

    ORDER BY MinCode

  • The logic for working out the above is:

    1. Self join the table on status where one code is less than or equal to the other code:

    SELECT T1.code AS MinCode

        ,T2.code AS MaxCode

        ,T1.status

    FROM YourTable T1

        JOIN YourTable T2

            ON T1.status = T2.status

                AND T1.code <= T2.code

    2. Filter out rows where intermediate statuses are not the same.

    SELECT T1.code AS MinCode

        ,T2.code AS MaxCode

        ,T1.status

    FROM YourTable T1

        JOIN YourTable T2

            ON T1.status = T2.status

                AND T1.code <= T2.code

    WHERE T1.status = ALL (

            SELECT T3.status

            FROM YourTable T3

            WHERE T3.code BETWEEN T1.code AND T2.code

        )

    3. Then do MIN and MAX to get the start and end codes of the range.

     

  • Ken, Brian-

     

    Thanks for the replys. Now off to do validation as both answers returned different #rows. Let me compare the two and I'll get back.

     

     

  • Thanks to David le Quesne for showing me the above approach. Here is the thread:

    http://www.sqlservercentral.com/forums/shwmessage.aspx?forumid=8&messageid=303210&p=2

  • After review -

    Original table contained non-unique entries, in error. Obtained correct list and refreshed. Table contains 848 rows. 

    Both solutions now retrive identical count & valued rows (157). Only difference is time. Brians solution returns in less than 1 second , while Kens took 11 seconds.

     

     

     

  • In case someone wanted Brian's solution broken down into its process and components I went back through and did that.

    Process:

    1. Select all the rows (MyTable)

    2. SubQuery:  Get the highest Code (PrevTable) where the Status doesn't match the current row and Code < current row (Top 1 DESC)

    3. SubQuery:  Get the lowest Code (PriorTable) where the Status is equal to the the current row and Code > #2 result. (Top 1 ASC) That should be the first occurrence of this Status in the current group.

    4. Wrap #3 result in an IsNull->0 and that is the Minimum Code for the group

    5. SubQuery:  Get the lowest Code (NxtTable) where the Status doesn't match the current row and Code is greater than current row, return current code for Null (Top 1 ASC)

    6. SubQuery:  Get the highest Code (MaxTable) where the Status is equal to the the current row and Code <= #5 result. (Top 1 DESC) That should be the last occurrence of this Status in the current group.

    7. I think there is an unnecessary IsNull->0 there.

    8. Return only the DISTINCT records to eliminate the dupes since the recordset is not grouped.

    #1 and 8:

    SELECT DISTINCT 

     (#2-4)   AS MinCode,

     (5-7)    AS MaxCode,

     Status

     FROM MyTable ORDER BY MinCode

     

     

    #2:

    ISNULL(

        (SELECT TOP 1 PrevTable.Code FROM MyTable AS PrevTable

        WHERE PrevTable.Code < MyTable.Code

        AND PrevTable.Status <> MyTable.Status

        ORDER BY Code Desc),0)

    #3-4:

    ISNULL(

     (SELECT TOP 1 PriorTable.Code AS PriorCodeStatus

      FROM MyTable AS PriorTable

      WHERE PriorTable.Code <= MyTable.Code

       AND PriorTable.Status = MyTable.Status

       AND PriorTable.Code >=

        //***** #1 Subquery Here

      ORDER BY Code),0)

    #5:

    ISNULL(

        (SELECT TOP 1 NxtTable.Code FROM MyTable AS NxtTable

        WHERE NxtTable.Code > MyTable.Code

        AND NxtTable.Status <> MyTable.Status

        ORDER BY Code),Code)

       

    #6-7:

    ISNULL(

     (SELECT TOP 1 MaxTable.Code AS MaxCodeStatus

      FROM MyTable AS MaxTable

      WHERE MaxTable.Code >= MyTable.Code

       AND MaxTable.Status = MyTable.Status

       AND MaxTable.Code <=

        //***** #5 Subquery Here

      ORDER BY Code DESC),0)

    I hope that is helpful.  I think you can eliminate the IsNull in #6-7 and also the one that wraps #3.  Beyond that it is just a brute force solution that avoids the restrictions that Group By imposes on subqueries.  To really do this properly I would probably attempt to do this with a cursor and with a temp table or a function that returns a table and compare the execution plans and statistics based on the actual data.  Of course, like lots of us, once I have a solution it is on to the next crisis...

  • Thanks for this.

    As a matter of interest, how quick is my style of solution if there is an index on status,code?

     

  • Try this too!

    -- Prepare sample data

    DECLARE @Sample TABLE (ID INT, Status VARCHAR(9))

    INSERT @Sample

    SELECT 1, 'S' UNION ALL

    SELECT 2, 'T' UNION ALL

    SELECT 3, 'T' UNION ALL

    SELECT 5, 'T' UNION ALL

    SELECT 10, 'S' UNION ALL

    SELECT 16, 'S' UNION ALL

    SELECT 17, 'S' UNION ALL

    SELECT 20, 'A' UNION ALL

    SELECT 31, 'T' UNION ALL

    SELECT 32, 'T'

    -- Stage the data

    DECLARE @Stage TABLE (RecID INT IDENTITY(1, 1), ID INT, Status VARCHAR(9), First INT, Last INT)

    INSERT  @Stage (ID, Status)

    SELECT  ID,

      Status

    FROM  @Sample

    ORDER BY ID

    UPDATE s

    SET s.First = (SELECT MAX(x.RecID) FROM @Stage AS x WHERE x.Status <> s.Status AND x.RecID < s.RecID) + 1,

     s.Last = (SELECT MIN(x.RecID) FROM @Stage AS x WHERE x.Status <> s.Status AND x.RecID > s.RecID) - 1

    FROM @Stage AS s

    UPDATE @Stage

    SET First = 1

    WHERE First IS NULL

    DECLARE @Max INT

    SELECT @Max = MAX(RecID)

    FROM @Stage

    WHERE Last IS NULL

    UPDATE @Stage

    SET Last = @max-2

    WHERE Last IS NULL

    -- Show the expected result

    SELECT DISTINCT t1.ID AS [Min],

      t2.ID AS [Max],

      s.Status

    FROM  @Stage AS s

    INNER JOIN @Stage AS t1 ON t1.RecID = s.First

    INNER JOIN @Stage AS t2 ON t2.RecID = s.Last

    ORDER BY t1.ID


    N 56°04'39.16"
    E 12°55'05.25"

  •  

    As some have suggested, both methods get pretty slow, even in the presence of proper indexing, simply because of the triangular joins involved. As is typical of large triangular joins, they spawn so many "internal" rows trying to come up with an answer that they become pretty much unusable for anything more than about 10,000 rows. Here's a chart of the performance of Brian's, Ken's, and a "new" query to do this…  All times are in seconds...

    Rows

    500

    884

    1000

    2000

    4000

    8000

    16000

    100000

    1000000

    Jeff

    0.123

    0.190

    0.250

    0.436

    0.483

    0.843

    1.390

    6.716

    41.860

    Brian

    0.576

    1.483

    1.830

    7.080

    29.373

    118.720

    466.576

    Ken

    0.376

    1.296

    1.640

    6.483

    30.283

    145.938

    762.016

    Before we get to the "Jeff" query, lemme just say that the idea of using either a CURSOR or a WHILE Loop would probably get the job done but I'm not so sure it would get it done as quickly.  Yeah, a cursor would probably do it... but not as fast as what I'm going to show you... and, don't balk at the temp table... where do you think a cursor is stored?  Thats right... TempDB... just like the internals for triangular joins or a temp table.

    So, if you have a million row table that looks like this...

        SET NOCOUNT ON

        USE TempDB

    --===== If the test table already exists, drop it

         IF OBJECT_ID('MyTable') IS NOT NULL

            DROP TABLE MyTable

    --===== Create the test table

     CREATE TABLE MyTable (Code INT IDENTITY(1,1),Status CHAR(1) NOT NULL)

    --===== Populate the test table with required data

     INSERT INTO MyTable (Status)

     SELECT TOP 1000000

            CAST(RAND(CAST(NEWID() AS VARBINARY))*3.0 AS INT)+1 AS Status

       FROM Master.dbo.SYSCOLUMNS sc1,

            Master.dbo.SYSCOLUMNS sc2

     UPDATE MyTable

        SET Status = CASE STATUS WHEN '1' THEN 'A' WHEN '2' THEN 'S' WHEN '3' THEN 'T' END

    --===== Add some indexes just to be fair

     ALTER TABLE MyTable

       ADD CONSTRAINT PK_MyTable_Code_Status PRIMARY KEY CLUSTERED (Code ASC,Status ASC)

     CREATE INDEX IX_MyTable2 ON MyTable (Code DESC,Status DESC)

    --===== Display the first 100 rows

     SELECT top 100 * FROM MyTable ORDER BY Code

    ... the following code will work faster than a cursor (just can't bring myself to use one ) and just about any other method...

    --===== Variable for speed measurement only

    DECLARE @StartTime DATETIME

        SET @StartTime = GETDATE()

    --===== Move the data into a temp table where we can work on it

         -- and add a couple of extra columns (takes about 3 seconds_

     SELECT Code,Status,CAST(NULL AS INT) AS MyMin, CAST(NULL AS INT) AS MyMax

      INTO #MyHead

    FROM MyTable

    --===== Add some required indexing (takes about 5 seconds)

     ALTER TABLE #MyHead

       ADD CONSTRAINT PK_MyHead_Code_Status PRIMARY KEY CLUSTERED (Code ASC,Status ASC)

     CREATE INDEX IX_MyHead2 ON #MyHead (Code DESC,Status DESC)

    --===== Create an update statement using dynamic SQL so the new indexes can be used

         -- in this same batch without an error (takes 0 seconds)

    DECLARE @sql VARCHAR(8000)

        SET @sql = '

    DECLARE @PrevStatus CHAR(1)

    DECLARE @PrevCode INT

        SET @PrevStatus = '' ''

        SET @PrevCode = 0

    --===== "Smears" the Mins across all rows

     UPDATE #MyHead

       SET  @PrevCode = MyMin = CASE WHEN Status = @PrevStatus THEN @PrevCode ELSE Code END,

            @PrevStatus = Status

       FROM #MyHead WITH (INDEX(PK_MyHead_Code_Status))

    SET @PrevStatus = '' ''

    SET @PrevCode = 0

    --===== "Smears" the Maxs across all rows

     UPDATE #MyHead

       SET  @PrevCode = MyMax = CASE WHEN Status = @PrevStatus THEN @PrevCode ELSE Code END,

            @PrevStatus = Status

       FROM #MyHead WITH (INDEX(IX_MyHead2))

    '

    --===== Execute the dynamic SQL to do the min/max "smears" (takes about 19 seconds)

       EXEC (@SQL)

    --===== Display the desired result (takes about 10 seconds)

     SELECT DISTINCT

            MyMin AS Min, MyMax AS Max, Status

       FROM #MyHead

      ORDER BY Min

    DROP TABLE #MyHead

    PRINT DATEDIFF(ms,@StartTime,GETDATE())

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Peter, not sure what your's would do if you added a primary key to the table but it currently take about 15 seconds to do 4,000 rows, 60 seconds to do 8,000 rows, and 240 seconds to do 16,000 rows (double the rows takes 4 times longer).  Like I said, those nasty little triangular joins just eat your face off

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply