October 8, 2010 at 2:37 am
Hi All,
I have a situation to find the duplicate records in a table using the comparing columns. The comparing columns can have NULL values.
Sctipt to create the table
Create Table DupTest
(
OID bigint null,
Test VARCHAR(46) null,
TestColumn int null,
TestString VARCHAR(50) null,
IFFRN Bigint Identity(1,1)
)
Insert Script
Insert DupTest values
(1,'sdfsd',10,NULL),
(2,NULL,11,'test2'),
(1,'sdfsd',NULL,''),
(1,'sdfsd',NULL,''),
(1,'sdfsd',10,NULL),
(1,'sdfsd',NULL,NULL)
Select * from DupTest
OIDTestTestColumnTestStringIFFRN
[highlight="#FF99FF"]
1sdfsd10 NULL 1[/highlight]
2NULL11 test2 2[highlight="#FFFF00"]
1sdfsdNULL 3
1sdfsdNULL 4[/highlight][highlight="#FF99FF"]
1sdfsd10 NULL 5[/highlight]
1sdfsdNULL NULL 6
The Rows highlighted are the duplicate rows.
I have used the CTE with Row_Number() rank function to find the duplicate rows.
;WITH myCTE AS (
SELECT
oid,Test,TestColumn,TestString,IFFRN,
ROW_NUMBER() OVER (PARTITION BY oid,Test,TestColumn,TestString ORDER BY IFFRN) RowID
FROM DupTest
)
SELECT * FROM myCTE
oidTestTestColumnTestStringIFFRNRowID
1sdfsdNULL NULL 6 1[highlight="#FFFF00"]
1sdfsdNULL 3 1
1sdfsdNULL 4 2[/highlight][highlight="#FF99FF"]
1sdfsd10 NULL 1 1
1sdfsd10 NULL 5 2[/highlight]
2NULL11 test2 2 1
In the above result set I want only the highlighted set of Rows.
I need the result as below.. I need only the duplicated set of rows.
Please help me with the respective query.
IFFRNRowID
31
42
11
52
October 8, 2010 at 2:41 am
try using "count(*) over (partition by...."
October 8, 2010 at 2:55 am
Thanks dave... its working but in my real time scenario where i have a millon of records to compare count(*) over (paritition...
is taking very long time to return the result compare to ROW_NUMBER() function..
October 8, 2010 at 3:07 am
With proper indexing , it shouldnt be to bad
Alternatively comparing the min and max values can give you the same results
Create table BigTable
(
id integer identity,
RId integer not null
)
go
create index idxBigTable on BigTable (Rid) include(id)
go
insert into bigtable(rid)
Select ABS(checksum(newid())) %10000
from sys.columns a cross join sys.columns b
go
with cteCount
as(
select id,COUNT(*) over (partition by rid) as counter
from bigtable
)
select *
from cteCount
where counter> 1
go
go
select rid
from bigtable
group by rid
having MIN(id) <> MAX(id)
October 8, 2010 at 3:19 am
Maybe this could work (with the right index)... though i doubt it. Should probably be worse then the count version. But your the one with the data to test 🙂
;WITH myCTE AS (
SELECT
oid,Test,TestColumn,TestString,IFFRN,
RANK() OVER (PARTITION BY oid,Test,TestColumn,TestString ORDER BY IFFRN) RowID
FROM DupTest
)
select dt.*
from DupTest dt
join (select * from myCTE where RowId > 1) c
on c.oid = dt.oid
and (c.Test = dt.Test or (c.Test is null and dt.Test is null))
and (c.TestColumn = dt.TestColumn or (c.TestColumn is null and dt.TestColumn is null))
and (c.TestString = dt.TestString or (c.TestString is null and dt.TestString is null))
/T
October 8, 2010 at 3:52 am
Hi,
May be you can try this.
WITH myCTE AS (
SELECT
oid,Test,TestColumn,TestString,IFFRN,
ROW_NUMBER() OVER (PARTITION BY oid,ISNULL(Test,'X'),ISNULL(TestColumn,-12),ISNULL(TestString,'X') ORDER BY IFFRN) RowID
FROM DupTest
)
SELECT * FROM myCTE
October 9, 2010 at 7:55 pm
One alternative is the old school way: Do a summary query.
GROUP BY all the columns used to determine a duplicate . Store all results HAVING COUNT(*) > 1 in a table variable or #temp table whose clustered index consists of the "duplicate" columns. Make a second pass through your detail table joining it to the summary results table on all the "duplicate" columns.
What column(s) are used for your clustered index on the production table?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 10, 2010 at 6:05 am
Another one to try:-)
;WITH cte AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY oid, Test, TestColumn, TestString ORDER BY IFFRN) AS RowAsc,
ROW_NUMBER() OVER (PARTITION BY oid, Test, TestColumn, TestString ORDER BY IFFRN DESC) AS RowDesc,
oid, Test, TestColumn, TestString, IFFRN
FROM DupTest
)
SELECT oid, Test, TestColumn, TestString, IFFRN
FROM cte
WHERE NOT (RowAsc = RowDesc AND RowAsc = 1 AND RowDesc = 1)
October 13, 2010 at 12:22 am
Thanks for all your ideas and Suggestions.
Hi Dixie,
In the production the query is running on a view which has again joined using many other views. It has 4 views and 12 tables.
We are currently reviewing the design approach and possiblities of tuning the joins used in the views.
Thanks
October 13, 2010 at 7:32 am
If some of the views reference the same tables you may want to consider rewriting the query entirely to run without views.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply