Need to get consecutive duplicate rows

  • I need a query that can get me (the ID of) consecutive duplicate rows, where I am grouping on some of the columns, don't care about rest.
    I know I have to use row  over to look for consecutive, and group by certain columns that make the dupes, but I am having a hard time with the grouping, and where to put the grouping (sub query, main query).

    The following sets up my data in a temp table :
    ---------------------------------------------------------------
    IF OBJECT_ID('TempDB..#Temp') IS NOT NULL
    DROP TABLE #Temp;

    CREATE TABLE #Temp
    ( ColA INT
    ,ColB INT
    ,ColC CHAR(1)
    ,ColD CHAR(1)
    ,ColE DECIMAL(3,2)
    )

    INSERT INTO #Temp ( ColA , ColB , ColC, ColD, ColE)
        SELECT 0, 1, 'A', 'A',1.1
    UNION ALL SELECT 1, 1, 'A', 'A',1.2
    UNION ALL SELECT 2, 1, 'A', 'A',1.3
    UNION ALL SELECT 3, 1, 'D', 'X',1.4
    UNION ALL SELECT 4, 1, 'A', 'A',1.5
    UNION ALL SELECT 5, 1, 'A', 'A',1.6
    UNION ALL SELECT 6, 1, 'A', 'A',1.7
    UNION ALL SELECT 7, 1, 'A', 'A',1.8
    UNION ALL SELECT 8, 2, 'A', 'B',2.1
    UNION ALL SELECT 9, 2, 'A', 'B',2.2
    UNION ALL SELECT 10, 2, 'D', 'X',2.3
    UNION ALL SELECT 11, 2, 'A', 'B',2.4
    UNION ALL SELECT 12, 2, 'A', 'B',2.5
    UNION ALL SELECT 13, 2, 'A', 'B',2.6
    UNION ALL SELECT 14, 2, 'A', 'B',1.1

    SELECT * FROM #Temp

    -A dupe is only considering ColB, C and D (highlighted below). I Need the IDs (ColA)

    ColA    ColB    ColC    ColD    ColE

    0    1    A    A    1.10
    1    1    A    A    1.20
    2    1    A    A    1.30
    3    1    D    X    1.40
    4    1    A    A    1.50
    5    1    A    A    1.60
    6    1    A    A    1.70
    7    1    A    A    1.80
    8    2    A    B    2.10
    9    2    A    B    2.20
    10    2    D    X    2.30
    11    2    A    B    2.40
    12    2    A    B    2.50
    13    2    A    B    2.60
    14    2    A    B    1.10

    ColA
    -----
     1 
    2

     5
     6
     7
     9
    12
    13
    14

    Any help would be appreciated!

  • Here is one way:

    WITH AllVals
    AS
    (
      SELECT
       *
      ,  PrevColB = LAG(t.ColB, 1) OVER (ORDER BY t.ColA)
      ,  PrevColC = LAG(t.ColC, 1) OVER (ORDER BY t.ColA)
      ,  PrevColD = LAG(t.ColD, 1) OVER (ORDER BY t.ColA)
      FROM #Temp t
    )
    SELECT v.ColA
    FROM AllVals v
    WHERE
       v.ColB  = v.PrevColB
       AND v.ColC = v.PrevColC
       AND v.ColD = v.PrevColD;

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • I know the post is in a 2012 forum, but just in case, here's another way that will work on versions pre-2012:

    WITH
    grouped AS
    (
    SELECT *, group_id=ColA-ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA ASC)
    FROM #temp
    )
    ,numbered AS
    (
    SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD,group_id ORDER BY ColA ASC)
    FROM grouped
    )

    SELECT ColA
    FROM numbered
    WHERE rn>1
    ORDER BY ColA ASC;

    Also, just for kicks and giggles, here's another way (I wouldn't actually use this one; I'm including it only because I tried my hand at finding the most concise query that satisfied the requirements, and this was the best I ended with):

    SELECT ColA=y
    FROM #Temp
       CROSS APPLY
         (VALUES(ColA),(ColA+1))x(y)
    GROUP BY y,ColB,ColC,ColD
    HAVING COUNT(*)=2;

    Cheers!

  • SSCoach:
    Awesome, that did it. I had to convert your WITHs to use temp tables, but thank you x 100 !!

  • I'm glad you got it sorted out!

    I'm curious, though. What exactly was the change you had to make?

    Cheers!

    EDIT: Fixed some wording I didn't like.

  • I mean I didn't use With...

    SELECT *, group_id=ColA-ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA)
    INTO #Temp2
    FROM #Temp

    SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD,group_id ORDER BY ColA)
    INTO #temp3
    FROM #temp2
    SELECT ColA FROM #temp3 WHERE  rn > 1 ORDER BY ColA;
  • Thanks!

    Continuation of my curiosity: why was it that that method had to be used instead of the CTEs? Performance differences, or something else? 

    Cheers!

  • Jacob Wilkins - Friday, February 16, 2018 2:49 PM

    I know the post is in a 2012 forum, but just in case, here's another way that will work on versions pre-2012:

    WITH
    grouped AS
    (
    SELECT *, group_id=ColA-ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA ASC)
    FROM #temp
    )
    ,numbered AS
    (
    SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD,group_id ORDER BY ColA ASC)
    FROM grouped
    )

    SELECT ColA
    FROM numbered
    WHERE rn>1
    ORDER BY ColA ASC;

    Also, just for kicks and giggles, here's another way (I wouldn't actually use this one; I'm including it only because I tried my hand at finding the most concise query that satisfied the requirements, and this was the best I ended with):

    SELECT ColA=y
    FROM #Temp
       CROSS APPLY
         (VALUES(ColA),(ColA+1))x(y)
    GROUP BY y,ColB,ColC,ColD
    HAVING COUNT(*)=2;

    Cheers!

    Careful, Jacob.  There's no guarantee that ColA won't have gaps in it.  Although it certainly does require another calculation, you might want to use the following to counter the eventuality of ColA having a gap in it.

      Grp_ID = ROW_NUMBER() OVER (ORDER BY ColA)
             - ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case,  then yes, a change on those lines is needed. 

    Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.

    The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂

  • Jacob Wilkins - Saturday, February 17, 2018 2:33 PM

    Jeff,

    Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case,  then yes, a change on those lines is needed. 

    Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.

    The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂

    Thanks. The consecutiveness is after order by Col2, Col1, Then Cols3-5 are same, and consecutive. I hope that helps.

  • Please describe why the solution i provided is unsatisfatory. It worked ok for me.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Jacob Wilkins - Saturday, February 17, 2018 2:33 PM

    Jeff,

    Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case,  then yes, a change on those lines is needed. 

    Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.

    The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂

    Understood but where did the OP state that the sequence numbers in ColA were guaranteed to not contain any gaps in the numeric sequence?  My suggestion is to always assume the worst when such things aren't specified.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Saturday, February 17, 2018 4:11 PM

    Jacob Wilkins - Saturday, February 17, 2018 2:33 PM

    Jeff,

    Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case,  then yes, a change on those lines is needed. 

    Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.

    The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂

    Understood but where did the OP state that the sequence numbers in ColA were guaranteed to not contain any gaps in the numeric sequence?  My suggestion is to always assume the worst when such things aren't specified.

    ColA would be Identity, but there could still be gaps (rollback on error, etc).

    Let me explain the original problem:
    Users could click many times, not realizing they were updating the item over and over (bad latency protection, I know). Each one of those updates caused a history snapshot in a history table (simulated in the table in my original post). Since the users are paid by the click, you can see the problem. I need to remove repetitive rows, indicated by exact ColC-D-E rows, in sequence one after another, with no changes in between, grouped by ColB. Column A is the IDENTITY pk.

    I believe the solution put forth by Jacob Wilkins satisfies the goal.

  • Nowhere. That's my point. In some common query scenarios you just want the "next" value, wherever that is, and that's when you need to be careful not to assume there are no gaps.

    My reading of "consecutive" is stronger than that, i.e., to be consecutive the sequence must be unbroken.

    On such a reading, it would actually be a mistake to treat 1 and 3 as consecutive just because there was no 2.

    It's not that I assumed there are no gaps; it's that I read gaps as not allowing the values on either side to be "consecutive".

    It's not the case that adding the row number and using that is something that works in all cases while mine fails in the "worst case".

    They're completely different answers to different requirements. Clarifying whether the OP meant consecutive in the sense in which I took it, or just "next" as you took it is well worth it, so thanks for mentioning it 🙂

    Cheers!

    EDIT: Fixed a typo.

  • Understood and appreciated.  But you're only focusing on what the OP asked.  While that certainly sounds like the right thing to do, even the OP has made no guarantee that all of the integers in ColA will be consecutive.  And it's not just the OP that I'm thinking of here.  Someone else might read this post for a similar problem and not understand that the key to all of this working is that the values of ColA must actually be consecutive (i.e. no "gaps") in order for this to work without the extra ROW_NUMBER(), which would make your good solution virtually bullet proof both for the OP and anyone else with a similar problem to solve.

    To be sure, I'm not bad mouthing you or your solution... I'm just suggesting a method to make it bullet-proof.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply