How to identify duplicate data with out Primary Key

Question

How to identify duplicate data with out Primary Key

monilps

SSCommitted

Points: 1686
More actions
January 13, 2016 at 8:52 am

#310801

Hi,
I need to identify possible duplicate (not exactly duplicate row) data without primary key:
Table A:
Col1 Col2 Col3 Col4 Col5 Col6
123 ABC XYZ 12-1-2015 76 76.78
123 ABCX XYZ 12-1-2015 76 76.78
So, row 2 is kind of duplicate and need to flag it.
Possible scenario, Any of the column might have some extra characters and need to flag as duplicate rows. In above case Row 2 might be duplicate since Col2 has "ABCX" compare to Col1 "ABC".
Please let me know how to achieve this.
Thanks in advance.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Phil Parkin SSC Guru Points: 246959 More actions · Answer 1

monilps (1/13/2016)
Hi,
I need to identify possible duplicate (not exactly duplicate row) data without primary key:
Table A:
Col1 Col2 Col3 Col4 Col5 Col6
123 ABC XYZ 12-1-2015 76 76.78
123 ABCX XYZ 12-1-2015 76 76.78
So, row 2 is kind of duplicate and need to flag it.
Possible scenario, Any of the column might have some extra characters and need to flag as duplicate rows. In above case Row 2 might be duplicate since Col2 has "ABCX" compare to Col1 "ABC".
Please let me know how to achieve this.
Thanks in advance.

So ... you need to tell us the rule for finding these duplicates. Is it:

Cols 1,3,4,5,6 exactly the same and first three characters of col2 the same?

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

monilps SSCommitted Points: 1686 More actions · Answer 2

Yes, So For Row 1 "ABC" is the data and for Row 2 "ABCX" is the data where "X" is additional.

NOTE: For Row 2 to flag as duplicate it should have Row 1 data plus few additional characters.

Steve Jones - SSC Editor SSC Guru Points: 736162 More actions · Answer 3

Phil is right. If this is any column, you might have a lot of work.

In general, you can use a count() on a column, with that column as a group by, to find duplicates.

CREATE TABLE a (col1 INT, col2 VARCHAR(20), col3 VARCHAR(20))

;

GO

INSERT a

VALUES

(123, 'ABC', 'XYZ'),

(123, 'ABCX', 'XYZ')

GO

SELECT col2, COUNT(col2)

FROM a

GROUP BY col2

HAVING COUNT(col2) > 1

GO

SELECT col3, COUNT(col3)

FROM a

GROUP BY col3

HAVING COUNT(col3) > 1

GO

DROP TABLE dbo.a

However, if you're looking for the beginning values in a column, that's tricky. Unless you have some idea of how many characters are duplicate, you are asking for a lot of checking.

monilps SSCommitted Points: 1686 More actions · Answer 4

monilps

SSCommitted

Points: 1686

January 13, 2016 at 9:24 am

#1851627

Yeah, this is for any column.

Phil Parkin SSC Guru Points: 246959 More actions · Answer 5

monilps (1/13/2016)
Yeah, this is for any column.

OK, this answer contradicts the earlier one.

So ... you want to identify rows where another row exists which is 'similar'.

'Similar' here means that all of the columns bar 1 are identical. The column which is not identical matches on the first three characters. But this could be any of the columns.

Is that more accurate? If that's what you want, this gets very ugly if the number of columns is more than a handful.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

monilps SSCommitted Points: 1686 More actions · Answer 6

For above example Col3 needs to check with Col3 only.

Phil Parkin SSC Guru Points: 246959 More actions · Answer 7

monilps (1/13/2016)
For above example Col3 needs to check with Col3 only.

I'm sorry, but your vague single-sentence responses are not providing the level of clarity required to answer such a complex requirement.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

monilps SSCommitted Points: 1686 More actions · Answer 8

I know it is complex:

Please see below, hope it is more clear:

Headers Col1 Col2 Col3 Col4 ... Coln

Row1 123 ABC XYZ 12/1/2015 ... qaz

Row2 345 XCV NHT 12/2/2015 ... wsx

Row3 123 ABX XYZ 12/2/2015 ... DFG

Row4 345 XCV NHTXV 12/2/2015 ... wsx

Row5 123 ABCX XYZ 12/1/2015 ... qaz

NOTE: I need to flag Row 4 and Row 5 as possible duplicate.

In order to flag as duplicate all the data should be same except one column and that column should have same data plus additional data.

1. Need to find rows which have similar data

2. Each column data should compare with same column data.

Thanks.

drew.allen SSC Guru Points: 76985 More actions · Answer 9

Even though your data doesn't have a primary key, it should have one or more natural keys. You need to identify what those natural key might be and use those in identifying possible duplicates. It's impossible to identify potential natural keys given your vague data.

If a record has multiple natural keys, you might consider weighting the matches on each of those natural keys and then reviewing pairs of records with the highest total weighted matches.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

ScottPletcher SSC Guru Points: 101092 More actions · Answer 10

Is the "base length" of the columns at least the same? That is, when you say the value is the same except chars are added, is the initial value always the same length, such a 3 bytes?

That is, could this be sample data:

col2

abc --all initial values are 3 chars?

def

ghi

defXY

But NOT this:

col2

jklm --4 chars and not 3

jklmXY

Although each column is separate, of course, so this would be fine:

col3

jklm --initial value always 4 chars for this column

opqr

stuv

jklmXY

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Phil Parkin SSC Guru Points: 246959 More actions · Answer 11

This fits your example data. But is it too specific for your real-world data?

if OBJECT_ID('tempdb..#dupes') is not null

drop table #dupes;

create table #dupes ( Id int identity (1,1) primary key clustered

, Col1 int

, Col2 Varchar(10)

, Col3 Varchar(10)

, Col4 Date

, Col5 Varchar(10) )

insert #dupes ( Col1, Col2, Col3, Col4, Col5 )

values ( 123, 'ABC', 'XYZ', '20150112', 'qaz' )

, ( 345, 'XCV', 'NHT', '20150212', 'wsx' )

, ( 123, 'ABX', 'XYZ', '20150212', 'DFG' )

, ( 345, 'XCV', 'NHTXV', '20150212', 'wsx' )

, ( 123, 'ABCX', 'XYZ', '20150112', 'qaz' )

select Col1

, Col2 = left(Col2,3)

, Col3

, Col4

, Col5

from #dupes

group by Col1

, left(Col2,3)

, Col3

, Col4

, Col5

having count(1) > 1

union all

select Col1

, Col2

, Col3 = left(Col3,3)

, Col4

, Col5

from #dupes

group by Col1

, Col2

, left(Col3,3)

, Col4

, Col5

having count(1) > 1

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

monilps SSCommitted Points: 1686 More actions · Answer 12

Hey Phil,

The query is close by what I am looking for but two things:

- Query is only resulting back original rows which has similar duplicate data.

- What if data is more than 3 characters in the real world which we need to compare.

Thanks for your help.

Phil Parkin SSC Guru Points: 246959 More actions · Answer 13

monilps (1/13/2016)
Hey Phil,
The query is close by what I am looking for but two things:
- Query is only resulting back original rows which has similar duplicate data.
- What if data is more than 3 characters in the real world which we need to compare.
Thanks for your help.

- What do you want to see?

- No problem. Modify the LEFT(string, [number of characters]) construction, where [number of characters] is the number you wish to compare.

The absence of evidence is not evidence of absence
- Martin Rees
The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
- Phil Parkin

monilps SSCommitted Points: 1686 More actions · Answer 14

Header Col1 Col2 Col3 Col4 ... Coln IsDuplicate

Row 1 123 ABC XYZ 12/1/2015 ... qaz No

Row 2 345 XCVE NHTX 12/2/2015 ... wsx No

Row 3 123 ABX XYZ 12/2/2015 ... DFG No

Row 4 345 XCVE NHTXVX 12/2/2015 ... wsx Yes

Row 5 123 ABCX XYZ 12/1/2015 ... qaz Yes