How to identify duplicate data with out Primary Key

  • Hi,

    I need to identify possible duplicate (not exactly duplicate row) data without primary key:

    Table A:

    Col1 Col2 Col3 Col4 Col5 Col6

    123 ABC XYZ 12-1-2015 76 76.78

    123 ABCX XYZ 12-1-2015 76 76.78

    So, row 2 is kind of duplicate and need to flag it.

    Possible scenario, Any of the column might have some extra characters and need to flag as duplicate rows. In above case Row 2 might be duplicate since Col2 has "ABCX" compare to Col1 "ABC".

    Please let me know how to achieve this.

    Thanks in advance.

  • monilps (1/13/2016)


    Hi,

    I need to identify possible duplicate (not exactly duplicate row) data without primary key:

    Table A:

    Col1 Col2 Col3 Col4 Col5 Col6

    123 ABC XYZ 12-1-2015 76 76.78

    123 ABCX XYZ 12-1-2015 76 76.78

    So, row 2 is kind of duplicate and need to flag it.

    Possible scenario, Any of the column might have some extra characters and need to flag as duplicate rows. In above case Row 2 might be duplicate since Col2 has "ABCX" compare to Col1 "ABC".

    Please let me know how to achieve this.

    Thanks in advance.

    So ... you need to tell us the rule for finding these duplicates. Is it:

    Cols 1,3,4,5,6 exactly the same and first three characters of col2 the same?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Yes, So For Row 1 "ABC" is the data and for Row 2 "ABCX" is the data where "X" is additional.

    NOTE: For Row 2 to flag as duplicate it should have Row 1 data plus few additional characters.

  • Phil is right. If this is any column, you might have a lot of work.

    In general, you can use a count() on a column, with that column as a group by, to find duplicates.

    CREATE TABLE a (col1 INT, col2 VARCHAR(20), col3 VARCHAR(20))

    ;

    GO

    INSERT a

    VALUES

    (123, 'ABC', 'XYZ'),

    (123, 'ABCX', 'XYZ')

    GO

    SELECT col2, COUNT(col2)

    FROM a

    GROUP BY col2

    HAVING COUNT(col2) > 1

    GO

    SELECT col3, COUNT(col3)

    FROM a

    GROUP BY col3

    HAVING COUNT(col3) > 1

    GO

    DROP TABLE dbo.a

    However, if you're looking for the beginning values in a column, that's tricky. Unless you have some idea of how many characters are duplicate, you are asking for a lot of checking.

  • Yeah, this is for any column.

  • monilps (1/13/2016)


    Yeah, this is for any column.

    OK, this answer contradicts the earlier one.

    So ... you want to identify rows where another row exists which is 'similar'.

    'Similar' here means that all of the columns bar 1 are identical. The column which is not identical matches on the first three characters. But this could be any of the columns.

    Is that more accurate? If that's what you want, this gets very ugly if the number of columns is more than a handful.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • For above example Col3 needs to check with Col3 only.

  • monilps (1/13/2016)


    For above example Col3 needs to check with Col3 only.

    I'm sorry, but your vague single-sentence responses are not providing the level of clarity required to answer such a complex requirement.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • I know it is complex:

    Please see below, hope it is more clear:

    Headers Col1 Col2 Col3 Col4 ... Coln

    Row1 123 ABC XYZ 12/1/2015 ... qaz

    Row2 345 XCV NHT 12/2/2015 ... wsx

    Row3 123 ABX XYZ 12/2/2015 ... DFG

    Row4 345 XCV NHTXV 12/2/2015 ... wsx

    Row5 123 ABCX XYZ 12/1/2015 ... qaz

    NOTE: I need to flag Row 4 and Row 5 as possible duplicate.

    In order to flag as duplicate all the data should be same except one column and that column should have same data plus additional data.

    1. Need to find rows which have similar data

    2. Each column data should compare with same column data.

    Thanks.

  • Even though your data doesn't have a primary key, it should have one or more natural keys. You need to identify what those natural key might be and use those in identifying possible duplicates. It's impossible to identify potential natural keys given your vague data.

    If a record has multiple natural keys, you might consider weighting the matches on each of those natural keys and then reviewing pairs of records with the highest total weighted matches.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Is the "base length" of the columns at least the same? That is, when you say the value is the same except chars are added, is the initial value always the same length, such a 3 bytes?

    That is, could this be sample data:

    col2

    abc --all initial values are 3 chars?

    def

    ghi

    defXY

    But NOT this:

    col2

    jklm --4 chars and not 3

    jklmXY

    Although each column is separate, of course, so this would be fine:

    col3

    jklm --initial value always 4 chars for this column

    opqr

    stuv

    jklmXY

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • This fits your example data. But is it too specific for your real-world data?

    if OBJECT_ID('tempdb..#dupes') is not null

    drop table #dupes;

    create table #dupes ( Id int identity (1,1) primary key clustered

    , Col1 int

    , Col2 Varchar(10)

    , Col3 Varchar(10)

    , Col4 Date

    , Col5 Varchar(10) )

    insert #dupes ( Col1, Col2, Col3, Col4, Col5 )

    values ( 123, 'ABC', 'XYZ', '20150112', 'qaz' )

    , ( 345, 'XCV', 'NHT', '20150212', 'wsx' )

    , ( 123, 'ABX', 'XYZ', '20150212', 'DFG' )

    , ( 345, 'XCV', 'NHTXV', '20150212', 'wsx' )

    , ( 123, 'ABCX', 'XYZ', '20150112', 'qaz' )

    select Col1

    , Col2 = left(Col2,3)

    , Col3

    , Col4

    , Col5

    from #dupes

    group by Col1

    , left(Col2,3)

    , Col3

    , Col4

    , Col5

    having count(1) > 1

    union all

    select Col1

    , Col2

    , Col3 = left(Col3,3)

    , Col4

    , Col5

    from #dupes

    group by Col1

    , Col2

    , left(Col3,3)

    , Col4

    , Col5

    having count(1) > 1

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Hey Phil,

    The query is close by what I am looking for but two things:

    - Query is only resulting back original rows which has similar duplicate data.

    - What if data is more than 3 characters in the real world which we need to compare.

    Thanks for your help.

  • monilps (1/13/2016)


    Hey Phil,

    The query is close by what I am looking for but two things:

    - Query is only resulting back original rows which has similar duplicate data.

    - What if data is more than 3 characters in the real world which we need to compare.

    Thanks for your help.

    - What do you want to see?

    - No problem. Modify the LEFT(string, [number of characters]) construction, where [number of characters] is the number you wish to compare.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Header Col1 Col2 Col3 Col4 ... Coln IsDuplicate

    Row 1 123 ABC XYZ 12/1/2015 ... qaz No

    Row 2 345 XCVE NHTX 12/2/2015 ... wsx No

    Row 3 123 ABX XYZ 12/2/2015 ... DFG No

    Row 4 345 XCVE NHTXVX 12/2/2015 ... wsx Yes

    Row 5 123 ABCX XYZ 12/1/2015 ... qaz Yes

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply