Need to get consecutive duplicate rows

Question

Need to get consecutive duplicate rows

pfefferl

SSC Veteran

Points: 297
More actions
February 16, 2018 at 1:57 pm

#403119

I need a query that can get me (the ID of) consecutive duplicate rows, where I am grouping on some of the columns, don't care about rest.
I know I have to use row over to look for consecutive, and group by certain columns that make the dupes, but I am having a hard time with the grouping, and where to put the grouping (sub query, main query).
The following sets up my data in a temp table :
---------------------------------------------------------------
IF OBJECT_ID('TempDB..#Temp') IS NOT NULL
DROP TABLE #Temp;
CREATE TABLE #Temp
( ColA INT
,ColB INT
,ColC CHAR(1)
,ColD CHAR(1)
,ColE DECIMAL(3,2)
)
INSERT INTO #Temp ( ColA , ColB , ColC, ColD, ColE)
    SELECT 0, 1, 'A', 'A',1.1
UNION ALL SELECT 1, 1, 'A', 'A',1.2
UNION ALL SELECT 2, 1, 'A', 'A',1.3
UNION ALL SELECT 3, 1, 'D', 'X',1.4
UNION ALL SELECT 4, 1, 'A', 'A',1.5
UNION ALL SELECT 5, 1, 'A', 'A',1.6
UNION ALL SELECT 6, 1, 'A', 'A',1.7
UNION ALL SELECT 7, 1, 'A', 'A',1.8
UNION ALL SELECT 8, 2, 'A', 'B',2.1
UNION ALL SELECT 9, 2, 'A', 'B',2.2
UNION ALL SELECT 10, 2, 'D', 'X',2.3
UNION ALL SELECT 11, 2, 'A', 'B',2.4
UNION ALL SELECT 12, 2, 'A', 'B',2.5
UNION ALL SELECT 13, 2, 'A', 'B',2.6
UNION ALL SELECT 14, 2, 'A', 'B',1.1
SELECT * FROM #Temp
-A dupe is only considering ColB, C and D (highlighted below). I Need the IDs (ColA)
ColA    ColB    ColC    ColD    ColE
0    1    A    A    1.10
1    1    A    A    1.20
2    1    A    A    1.30
3    1    D    X    1.40
4    1    A    A    1.50
5    1    A    A    1.60
6    1    A    A    1.70
7    1    A    A    1.80
8    2    A    B    2.10
9    2    A    B    2.20
10    2    D    X    2.30
11    2    A    B    2.40
12    2    A    B    2.50
13    2    A    B    2.60
14    2    A    B    1.10

ColA
-----
1
2
5
6
7
9
12
13
14
Any help would be appreciated!

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246985 More actions · Answer 1

Here is one way:

WITH AllVals AS ( SELECT * , PrevColB = LAG(t.ColB, 1) OVER (ORDER BY t.ColA) , PrevColC = LAG(t.ColC, 1) OVER (ORDER BY t.ColA) , PrevColD = LAG(t.ColD, 1) OVER (ORDER BY t.ColA) FROM #Temp t ) SELECT v.ColA FROM AllVals v WHERE v.ColB = v.PrevColB AND v.ColC = v.PrevColC AND v.ColD = v.PrevColD;

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 2

I know the post is in a 2012 forum, but just in case, here's another way that will work on versions pre-2012:

WITH grouped AS ( SELECT *, group_id=ColA-ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA ASC) FROM #temp ) ,numbered AS ( SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD,group_id ORDER BY ColA ASC) FROM grouped )

SELECT ColA FROM numbered WHERE rn>1 ORDER BY ColA ASC;

Also, just for kicks and giggles, here's another way (I wouldn't actually use this one; I'm including it only because I tried my hand at finding the most concise query that satisfied the requirements, and this was the best I ended with):

SELECT ColA=y FROM #Temp CROSS APPLY (VALUES(ColA),(ColA+1))x(y) GROUP BY y,ColB,ColC,ColD HAVING COUNT(*)=2;

Cheers!

pfefferl SSC Veteran Points: 297 More actions · Answer 3

SSCoach:
Awesome, that did it. I had to convert your WITHs to use temp tables, but thank you x 100 !!

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 4

I'm glad you got it sorted out!

I'm curious, though. What exactly was the change you had to make?

Cheers!

EDIT: Fixed some wording I didn't like.

pfefferl SSC Veteran Points: 297 More actions · Answer 5

I mean I didn't use With...

SELECT *, group_id=ColA-ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA)
INTO #Temp2
FROM #Temp

SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD,group_id ORDER BY ColA)
INTO #temp3
FROM #temp2

SELECT ColA FROM #temp3 WHERE rn > 1 ORDER BY ColA;

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 6

Thanks!

Continuation of my curiosity: why was it that that method had to be used instead of the CTEs? Performance differences, or something else?

Cheers!

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 7

Jacob Wilkins - Friday, February 16, 2018 2:49 PM
I know the post is in a 2012 forum, but just in case, here's another way that will work on versions pre-2012:
WITH grouped AS ( SELECT *, group_id=ColA-ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA ASC) FROM #temp ) ,numbered AS ( SELECT *, rn=ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD,group_id ORDER BY ColA ASC) FROM grouped )
SELECT ColA FROM numbered WHERE rn>1 ORDER BY ColA ASC;
Also, just for kicks and giggles, here's another way (I wouldn't actually use this one; I'm including it only because I tried my hand at finding the most concise query that satisfied the requirements, and this was the best I ended with):
SELECT ColA=y FROM #Temp CROSS APPLY (VALUES(ColA),(ColA+1))x(y) GROUP BY y,ColB,ColC,ColD HAVING COUNT(*)=2;
Cheers!

Careful, Jacob. There's no guarantee that ColA won't have gaps in it. Although it certainly does require another calculation, you might want to use the following to counter the eventuality of ColA having a gap in it.

Grp_ID = ROW_NUMBER() OVER (ORDER BY ColA)
- ROW_NUMBER() OVER (PARTITION BY ColB,ColC,ColD ORDER BY ColA)

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 8

Jeff,

Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case, then yes, a change on those lines is needed.

Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.

The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂

pfefferl SSC Veteran Points: 297 More actions · Answer 9

Jacob Wilkins - Saturday, February 17, 2018 2:33 PM
Jeff,
Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case, then yes, a change on those lines is needed.
Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.
The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂

Thanks. The consecutiveness is after order by Col2, Col1, Then Cols3-5 are same, and consecutive. I hope that helps.

Phil Parkin SSC Guru Points: 246985 More actions · Answer 10

Please describe why the solution i provided is unsatisfatory. It worked ok for me.

The absence of evidence is not evidence of absence.
Martin Rees

You can lead a horse to water, but a pencil must be lead.
Stan Laurel

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 11

Jacob Wilkins - Saturday, February 17, 2018 2:33 PM
Jeff,
Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case, then yes, a change on those lines is needed.
Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.
The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂

Understood but where did the OP state that the sequence numbers in ColA were guaranteed to not contain any gaps in the numeric sequence? My suggestion is to always assume the worst when such things aren't specified.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

pfefferl SSC Veteran Points: 297 More actions · Answer 12

Jeff Moden - Saturday, February 17, 2018 4:11 PM
Jacob Wilkins - Saturday, February 17, 2018 2:33 PM
Jeff,
Well aware of that. It was not specified that an id of, say, 7 counts as consecutive with an id of 5 so long as there are no rows with an id of 6. If that's the case, then yes, a change on those lines is needed.
Then again, maybe the requirement is that 5 and 7 are not to be counted as consecutive in the scenario above.
The queries were provided to meet the specifications provided. If the specifications are different, then yes, the queries will have to be different 🙂
Understood but where did the OP state that the sequence numbers in ColA were guaranteed to not contain any gaps in the numeric sequence? My suggestion is to always assume the worst when such things aren't specified.

ColA would be Identity, but there could still be gaps (rollback on error, etc).

Let me explain the original problem:
Users could click many times, not realizing they were updating the item over and over (bad latency protection, I know). Each one of those updates caused a history snapshot in a history table (simulated in the table in my original post). Since the users are paid by the click, you can see the problem. I need to remove repetitive rows, indicated by exact ColC-D-E rows, in sequence one after another, with no changes in between, grouped by ColB. Column A is the IDENTITY pk.

I believe the solution put forth by Jacob Wilkins satisfies the goal.

Jacob Wilkins One Orange Chip Points: 27976 More actions · Answer 13

Nowhere. That's my point. In some common query scenarios you just want the "next" value, wherever that is, and that's when you need to be careful not to assume there are no gaps.

My reading of "consecutive" is stronger than that, i.e., to be consecutive the sequence must be unbroken.

On such a reading, it would actually be a mistake to treat 1 and 3 as consecutive just because there was no 2.

It's not that I assumed there are no gaps; it's that I read gaps as not allowing the values on either side to be "consecutive".

It's not the case that adding the row number and using that is something that works in all cases while mine fails in the "worst case".

They're completely different answers to different requirements. Clarifying whether the OP meant consecutive in the sense in which I took it, or just "next" as you took it is well worth it, so thanks for mentioning it 🙂

Cheers!

EDIT: Fixed a typo.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 14

Understood and appreciated. But you're only focusing on what the OP asked. While that certainly sounds like the right thing to do, even the OP has made no guarantee that all of the integers in ColA will be consecutive. And it's not just the OP that I'm thinking of here. Someone else might read this post for a similar problem and not understand that the key to all of this working is that the values of ColA must actually be consecutive (i.e. no "gaps") in order for this to work without the extra ROW_NUMBER(), which would make your good solution virtually bullet proof both for the OP and anyone else with a similar problem to solve.

To be sure, I'm not bad mouthing you or your solution... I'm just suggesting a method to make it bullet-proof.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)