Problem in finding duplicate records using CTE

Question

Post reply

Problem in finding duplicate records using CTE

ralm

Hall of Fame

Points: 3156
More actions
October 8, 2010 at 2:37 am

#229459

Hi All,
I have a situation to find the duplicate records in a table using the comparing columns. The comparing columns can have NULL values.
Sctipt to create the table
Create Table DupTest
(
OID bigint null,
Test VARCHAR(46) null,
TestColumn int null,
TestString VARCHAR(50) null,
IFFRN Bigint Identity(1,1)
)
Insert Script
Insert DupTest values
(1,'sdfsd',10,NULL),
(2,NULL,11,'test2'),
(1,'sdfsd',NULL,''),
(1,'sdfsd',NULL,''),
(1,'sdfsd',10,NULL),
(1,'sdfsd',NULL,NULL)
Select * from DupTest
OIDTestTestColumnTestStringIFFRN
[highlight="#FF99FF"]
1sdfsd10 NULL 1[/highlight]
2NULL11 test2 2[highlight="#FFFF00"]
1sdfsdNULL 3
1sdfsdNULL 4[/highlight][highlight="#FF99FF"]
1sdfsd10 NULL 5[/highlight]
1sdfsdNULL NULL 6
The Rows highlighted are the duplicate rows.
I have used the CTE with Row_Number() rank function to find the duplicate rows.
;WITH myCTE AS (
SELECT
oid,Test,TestColumn,TestString,IFFRN,
ROW_NUMBER() OVER (PARTITION BY oid,Test,TestColumn,TestString ORDER BY IFFRN) RowID
FROM DupTest
)
SELECT * FROM myCTE
oidTestTestColumnTestStringIFFRNRowID
1sdfsdNULL NULL 6 1[highlight="#FFFF00"]
1sdfsdNULL 3 1
1sdfsdNULL 4 2[/highlight][highlight="#FF99FF"]
1sdfsd10 NULL 1 1
1sdfsd10 NULL 5 2[/highlight]
2NULL11 test2 2 1
In the above result set I want only the highlighted set of Rows.
I need the result as below.. I need only the duplicated set of rows.
Please help me with the respective query.
IFFRNRowID
31
42
11
52
^{[font="Verdana"]Regards,
Rals[/font].}

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Dave Ballantyne SSC-Dedicated Points: 33667 More actions · Answer 1

try using "count(*) over (partition by...."

Clear Sky SQL
My Blog[/url]

ralm Hall of Fame Points: 3156 More actions · Answer 2

Thanks dave... its working but in my real time scenario where i have a millon of records to compare count(*) over (paritition...

is taking very long time to return the result compare to ROW_NUMBER() function..

^{[font="Verdana"]Regards,
Rals[/font].}

Dave Ballantyne SSC-Dedicated Points: 33667 More actions · Answer 3

With proper indexing , it shouldnt be to bad

Alternatively comparing the min and max values can give you the same results

Create table BigTable

(

id integer identity,

RId integer not null

)

go

create index idxBigTable on BigTable (Rid) include(id)

go

insert into bigtable(rid)

Select ABS(checksum(newid())) %10000

from sys.columns a cross join sys.columns b

go

with cteCount

as(

select id,COUNT(*) over (partition by rid) as counter

from bigtable

)

select *

from cteCount

where counter> 1

go

select rid

from bigtable

group by rid

having MIN(id) <> MAX(id)

Clear Sky SQL
My Blog[/url]

tommyh SSCertifiable Points: 6302 More actions · Answer 4

Maybe this could work (with the right index)... though i doubt it. Should probably be worse then the count version. But your the one with the data to test 🙂

;WITH myCTE AS (

SELECT

oid,Test,TestColumn,TestString,IFFRN,

RANK() OVER (PARTITION BY oid,Test,TestColumn,TestString ORDER BY IFFRN) RowID

FROM DupTest

)

select dt.*

from DupTest dt

join (select * from myCTE where RowId > 1) c

on c.oid = dt.oid

and (c.Test = dt.Test or (c.Test is null and dt.Test is null))

and (c.TestColumn = dt.TestColumn or (c.TestColumn is null and dt.TestColumn is null))

and (c.TestString = dt.TestString or (c.TestString is null and dt.TestString is null))

/T

G.Madhuri SSC Journeyman Points: 78 More actions · Answer 5

Hi,

May be you can try this.

WITH myCTE AS (

SELECT

oid,Test,TestColumn,TestString,IFFRN,

ROW_NUMBER() OVER (PARTITION BY oid,ISNULL(Test,'X'),ISNULL(TestColumn,-12),ISNULL(TestString,'X') ORDER BY IFFRN) RowID

FROM DupTest

)

SELECT * FROM myCTE

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 6

One alternative is the old school way: Do a summary query.

GROUP BY all the columns used to determine a duplicate . Store all results HAVING COUNT(*) > 1 in a table variable or #temp table whose clustered index consists of the "duplicate" columns. Make a second pass through your detail table joining it to the summary results table on all the "duplicate" columns.

What column(s) are used for your clustered index on the production table?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

steve-893342 SSCarpal Tunnel Points: 4439 More actions · Answer 7

Another one to try:-)

;WITH cte AS

(

SELECT ROW_NUMBER() OVER (PARTITION BY oid, Test, TestColumn, TestString ORDER BY IFFRN) AS RowAsc,

ROW_NUMBER() OVER (PARTITION BY oid, Test, TestColumn, TestString ORDER BY IFFRN DESC) AS RowDesc,

oid, Test, TestColumn, TestString, IFFRN

FROM DupTest

)

SELECT oid, Test, TestColumn, TestString, IFFRN

FROM cte

WHERE NOT (RowAsc = RowDesc AND RowAsc = 1 AND RowDesc = 1)

pb-916752 Old Hand Points: 323 More actions · Answer 8

Thanks for all your ideas and Suggestions.

Hi Dixie,

In the production the query is running on a view which has again joined using many other views. It has 4 views and 12 tables.

We are currently reviewing the design approach and possiblities of tuning the joins used in the views.

Thanks

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 9

If some of the views reference the same tables you may want to consider rewriting the query entirely to run without views.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills