How to identify duplicate data with out Primary Key

  • it's helpful if you use a fixed font, so the Code=other or code=plain format to the left, when you post.

    How are rows 4 and 5 dupes? Unless you mean that row 4 is a dupe of row 1.

    If that's the case, then are there columns that you need to examine for potential different data? In other words, are these rows still dupes?

    Header Col1 Col2 Col3 Col4 ... Coln IsDuplicate

    Row 1 123 ABC XYZA 12/1/2015 ... qaz No

    Row 5 123 ABCX XYZ 12/1/2015 ... qaz1 Yes

  • Yes, you are correct; I meant Row 1 and Row 5 are dupes & Row 2 and Row 4 are dupes.

    In your example Row 1 and Row 5 are not dupes because they have different data for other columns. In order to identify dupes there should be only one column mismatch and that also it should be additional data to original one.

  • Using the harness from Phil's post:

    if OBJECT_ID('tempdb..#dupes') is not null

    drop table #dupes;

    create table #dupes (

    Id int identity (1,1) primary key clustered,

    Col1 int,

    Col2 Varchar(10),

    Col3 Varchar(10),

    Col4 Date,

    Col5 Varchar(10)

    )

    insert #dupes ( Col1, Col2, Col3, Col4, Col5 )

    values ( 123, 'ABC', 'XYZ', '20150112', 'qaz' )

    , ( 345, 'XCV', 'NHT', '20150212', 'wsx' )

    , ( 123, 'ABX', 'XYZ', '20150212', 'DFG' )

    , ( 345, 'XCV', 'NHTXV', '20150212', 'wsx' )

    , ( 123, 'ABCX', 'XYZ', '20150112', 'qaz' )

    SELECT *

    FROM #dupes dup

    INNER JOIN #dupes nondup ON dup.Id <> nondup.Id

    AND dup.Col1 = nondup.Col1

    AND dup.Col2 LIKE nondup.Col2 + '%'

    AND dup.Col3 LIKE nondup.Col3 + '%'

    AND dup.Col4 = nondup.Col4

    AND dup.Col5 LIKE nondup.Col5 + '%'

    SELECT dup.*,

    CASE WHEN nondup.id IS NULL THEN 'No' ELSE 'Yes' END [IsDuplicate]

    FROM #dupes dup

    LEFT JOIN #dupes nondup ON dup.Id <> nondup.Id

    AND dup.Col1 = nondup.Col1

    AND dup.Col2 LIKE nondup.Col2 + '%'

    AND dup.Col3 LIKE nondup.Col3 + '%'

    AND dup.Col4 = nondup.Col4

    AND dup.Col5 LIKE nondup.Col5 + '%'

    _____________
    Code for TallyGenerator

  • Thanks, this is exactly what I was looking for.

  • monilps (1/13/2016)


    Thanks, this is exactly what I was looking for.

    I disagree. Well, it may be what you are looking for, but it does not match your original requirement.

    In order to flag as duplicate all the data should be same except one column and that column should have same data plus additional data.

    Try adding this row to the values() statement and re-running:

    , ( 123, 'ABCZ', 'XYZZ', '20150112', 'qazz' )

    This row will be flagged as a duplicate, even though 3 or more of its columns differ from any one of the other rows.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply