How to identify duplicate data with out Primary Key

Question

How to identify duplicate data with out Primary Key

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 736162 More actions · Answer 1

it's helpful if you use a fixed font, so the Code=other or code=plain format to the left, when you post.

How are rows 4 and 5 dupes? Unless you mean that row 4 is a dupe of row 1.

If that's the case, then are there columns that you need to examine for potential different data? In other words, are these rows still dupes?

Header Col1 Col2 Col3 Col4 ... Coln IsDuplicate

Row 1 123 ABC XYZA 12/1/2015 ... qaz No

Row 5 123 ABCX XYZ 12/1/2015 ... qaz1 Yes

monilps SSCommitted Points: 1686 More actions · Answer 2

Yes, you are correct; I meant Row 1 and Row 5 are dupes & Row 2 and Row 4 are dupes.

In your example Row 1 and Row 5 are not dupes because they have different data for other columns. In order to identify dupes there should be only one column mismatch and that also it should be additional data to original one.

Sergiy SSC Guru Points: 110208 More actions · Answer 3

Using the harness from Phil's post:

if OBJECT_ID('tempdb..#dupes') is not null

drop table #dupes;

create table #dupes (

Id int identity (1,1) primary key clustered,

Col1 int,

Col2 Varchar(10),

Col3 Varchar(10),

Col4 Date,

Col5 Varchar(10)

)

insert #dupes ( Col1, Col2, Col3, Col4, Col5 )

values ( 123, 'ABC', 'XYZ', '20150112', 'qaz' )

, ( 345, 'XCV', 'NHT', '20150212', 'wsx' )

, ( 123, 'ABX', 'XYZ', '20150212', 'DFG' )

, ( 345, 'XCV', 'NHTXV', '20150212', 'wsx' )

, ( 123, 'ABCX', 'XYZ', '20150112', 'qaz' )

SELECT *

FROM #dupes dup

INNER JOIN #dupes nondup ON dup.Id <> nondup.Id

AND dup.Col1 = nondup.Col1

AND dup.Col2 LIKE nondup.Col2 + '%'

AND dup.Col3 LIKE nondup.Col3 + '%'

AND dup.Col4 = nondup.Col4

AND dup.Col5 LIKE nondup.Col5 + '%'

SELECT dup.*,

CASE WHEN nondup.id IS NULL THEN 'No' ELSE 'Yes' END [IsDuplicate]

FROM #dupes dup

LEFT JOIN #dupes nondup ON dup.Id <> nondup.Id

AND dup.Col1 = nondup.Col1

AND dup.Col2 LIKE nondup.Col2 + '%'

AND dup.Col3 LIKE nondup.Col3 + '%'

AND dup.Col4 = nondup.Col4

AND dup.Col5 LIKE nondup.Col5 + '%'

_____________
Code for TallyGenerator

monilps SSCommitted Points: 1686 More actions · Answer 4

monilps

SSCommitted

Points: 1686

January 13, 2016 at 8:27 pm

#1851797

Thanks, this is exactly what I was looking for.

Phil Parkin SSC Guru Points: 246959 More actions · Answer 5

monilps (1/13/2016)
Thanks, this is exactly what I was looking for.

I disagree. Well, it may be what you are looking for, but it does not match your original requirement.

In order to flag as duplicate all the data should be same except one column and that column should have same data plus additional data.

Try adding this row to the values() statement and re-running:

, ( 123, 'ABCZ', 'XYZZ', '20150112', 'qazz' )

This row will be flagged as a duplicate, even though 3 or more of its columns differ from any one of the other rows.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin