Removing partially duplicate records in SQL

  • Hello,

    I have a table with 1 billion records and I'd like to delete some records that are partially duplicate.

    Ex) the table below has three records

    RECID COL1 COL2 COL3

    ---------------------------------------

    1 A B C

    2 A B 'blank'

    3 A 'blank' C

    As you can see, RECID 2 and RECID 3 are duplicates because RECID 1 has all the values that 2 and 3 have.

    Since I have 1 Billion records, I don't want to go through row by row.

    What would be the best way to eliminate the duplicates in this case?

    Thanks a lot in advance.

  • What are your rules for determining if a row is a duplicate?

  • Thanks Jack,

    I already removed the pure duplicates (same values in all the columns), so the basic rule is if record A has values (excluding 'blank') that all exists in another record, it's a duplicate.

    RECID COL1 COL2 COL3

    ---------------------------------------

    1 A B C

    2 A B 'blank'

    3 A 'blank' C

    So, RECID 2 and RECID 3 are duplicates because the values (A, B in RECID 2 and A, C in RECID 3 excluding 'blank') all exist in RECID 1.

    below is to create the sample table

    create table dbo.TEMP (RECID int identity(1,1), COL1 varchar(1), COL2 varchar(1), COL3 varchar(1))

    insert into TEMP select 'A','B','C'

    insert into TEMP select 'A','B',''

    insert into TEMP select 'A','','C'

    I need to find out how to select RECID 2 and RECID 3, so that I can delete them from the table.

    Once again, unfortunately, I can't do row by row comparison just because the input table is too big.

    Thanks a lot

  • Max,

    Can there be a 4th possibility or not?

    RECID COL1 COL2 COL3

    ---------------------------------------

    1 A B C

    2 A B 'blank'

    3 A 'blank' C

    4 'blank' B C

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • And if you have something like:

    1 A B C

    2 A D C

    3 A E C

    4 A B F

    Do you want to keep all of them, or...? If you don't keep them all, how do you While using real data isn't necessary, it tends to be a lot more intuitive if you supply "dummy" data, rather than just random letters.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • And if you have something like:

    1 A B C

    2 A D C

    3 A E C

    4 A B F

    Do you want to keep all of them, or...? If you don't keep them all, how do you merge the different values? If you can just merge the values and they aren't relational to the rows, why is the table set up like that in the first place? While using real data isn't necessary, it tends to be a lot more intuitive if you supply "dummy" data, rather than just random letters.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • thanks for your reply.

    To answer Jeff's question, it's yes. The table contains about 1 billion records and about 10 differect columns (COL1 to COL10).

    I'd like to delete the 4th record as well because Col2 = B and Col = C exist in the 1st record.

    To answer Seth's question, yes, i have to keep them all since each record is unique.

    For example)

    [font="Courier New"]RECID SOURCE NAME PHONE SSN USER_ID

    ---------------------------------------------------------------------------

    1 A4 JEFF 29489 98579 A535

    2 B2 JEFF 98579 A535

    3 G5 29489 98579 A535

    4 H9 JEFF 98868 98579 A535[/font]

    In the example above, there are four records for a same person (USER_ID = A535).

    I can delete RECID 2 and 4, since all the information (NAME, PHONE, SSN) that I can get from them exists in RECID 1.

    But I have to keep RECID 4 because it has new information about the person (a new phone number).

    Unfortunately, I can't aggregate / propagate to keep only the unique values since I need to keep where the record came from (SOURCE).

    Thanks a lot

  • I think you want to make multiple passes, remove those duplicates based on the combinations. Jeff might have a one pass method, but I think the simplicity might run quickly even with multiple passes.

    Is this a one time thing or is it a recurring item?

    Also, you have 1 and 2 as duplicates and 1 and 3 as duplicates, so are you saying that in your example, only row 1 remains?

  • Something I think needs to be asked, source is different in all the rows, why doesn't this matter? The data was stored for some reason originally, do you no longer care about source? If you do, why are you throwing out rows without considering the source column? If you don't why not remove the column?

  • Oh yeah, that's much better than letters =). Jeff might have been doing this for so long that he just intuitively grasps it all, but I still need my training wheels. What is the feasibility of changing this table structure to keep 1 row in this table for items that will always be unique to the employee (such as SSN and EmployeeID), and creating a second table to store phone numbers that can be a 1 to many relationship? If you could apply that type of formatting, that would likely make a huge impact on the amount of rows that table contains, and give you more normalized data to boot.

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Thanks Steve,

    It should be a recurring process. Each week, I add about 3 million records to the big table but most of them are duplicates.

    Yes, in the first example, I will have only one record (RECID = 1).

    I am not quite sure what the pass method is, but since the table is big, the deduping process should be fast enough, so I can finish it over the weekend.

    Below is another example,

    create table CF_TEMP.dbo.TEMP (

    RECIDint identity(1,1),

    COL1varchar(1),

    COL2varchar(2),

    COL3varchar(3),

    USERIDint)

    insert into CF_TEMP.dbo.TEMP select 'A', 'B', 'C', 1

    insert into CF_TEMP.dbo.TEMP select 'A', 'B', '', 1

    insert into CF_TEMP.dbo.TEMP select 'A', '', 'C', 1

    insert into CF_TEMP.dbo.TEMP select 'F', '', 'C', 1

    insert into CF_TEMP.dbo.TEMP select '', 'M', '', 2

    insert into CF_TEMP.dbo.TEMP select 'T', 'M', 'O', 2

    insert into CF_TEMP.dbo.TEMP select '', 'M', 'O', 2

    insert into CF_TEMP.dbo.TEMP select 'X', 'M', 'O', 2

    USER_ID indicates they are for a same person. So I remove the duplicate within a same user_ID.

    RECIDCOL1COL2COL3USERID

    -----------------------------------------------

    1ABC1

    2AB1 -- DUPLICATE to RECID 1

    3AC1 -- DUPLICATE to RECID 1

    4FC1

    5M2 -- DUPLICATE to RECID 6

    6TMO2

    7MO2 -- DUPLICATE to RECID 7

    8XMO2

    In the end, I'd like to have only 4 records as below

    RECIDCOL1COL2COL3USERID

    -----------------------------------------------

    1ABC1

    4FC1

    6TMO2

    8XMO2

    Again, thanks a lot

  • How do you know which one to keep? The first one? Or the one that has all the fields?

    That's important otherwise you could say row 2 duplicates row 1.

    Multiple passes would be looking for a pattern like Row 2, where Col1 and 2 match, but Col3 is a null/blank. Then a second pattern would be col1 and 3 match, but 2 has a null/blank.

  • Thanks Steve,

    I need to keep a record that has more informationl like RECID 1.

    I need to keep both records in the example below.

    RECID COL1 COL2 COL3

    --------------------------------------

    1 A B (blank)

    2 (blank) B C

    It seems I can't use the multiple pass method because it will take too much time to process the big table.

    Thanks Jack and Seth,

    Unfortunately, I can't normalize the table.

    It should contain all the records. The purpose of this table is to assign the unique person ID.

    Let's say I have a record as below

    MARY JANE 123 MAIN ST 203-498-4982 498489889 1999-01-02 ... and many more

    later I receive another record

    MARY SMITH 456 SOUTH ST 245-364-4112 498489889 1999-01-02 ... and many more

    I need to find the records above are for a same person even though she changed her name, address and phone number and assign a same unique ID for both records.

    But most of cases, the records I recieve are duplicates or have lesser values as show below

    MARY JANE 123 MAIN ST 203-498-4982 1999-01-02 ... and many more

    It's not a pure duplicate since the third record doesn't have SSN populated but I have all the information from the 3rd record in the 1st record, so I don't have to keep it in the table.

    The reason why I need to keep the source column and keep every record separately is I need to do some audit and I should be able to recreate / refresh the person ID in case there is a problem with the input data.

    For exmple, the ETL group messed up a file and the first name and last name are reversed in the file received yesterday. I need to go back to the big table and should be able to correct the issue.

    thanks for your help! I really appreciate it.

  • It's doable in one step.

    See here...

    DECLARE@Sample TABLE

    (

    recID INT IDENTITY(1, 1),

    col1 VARCHAR(1),

    col2 VARCHAR(2),

    col3 VARCHAR(3),

    userID INT

    )

    INSERT@Sample

    SELECT'A', 'B', 'C', 1 UNION ALL

    SELECT'A', 'B', '', 1 UNION ALL

    SELECT'A', '', 'C', 1 UNION ALL

    SELECT'F', '', 'C', 1 UNION ALL

    SELECT'', 'M', '', 2 UNION ALL

    SELECT'T', 'M', 'O', 2 UNION ALL

    SELECT'', 'M', 'O', 2 UNION ALL

    SELECT'X', 'M', 'O', 2

    SELECT*

    FROM@Sample

    DELETEs

    FROM@Sample AS s

    INNER JOIN(

    SELECTMIN(recID) AS minID,

    col1,

    col2,

    col3,

    userID

    FROM@Sample

    WHERE'' NOT IN (col1, col2, col3)

    GROUP BYcol1,

    col2,

    col3,

    userID

    ) AS w ON w.userID = s.userID

    AND w.minID <> s.recID

    WHEREw.col1 = COALESCE(NULLIF(s.col1, ''), w.col1)

    AND w.col2 = COALESCE(NULLIF(s.col2, ''), w.col2)

    AND w.col3 = COALESCE(NULLIF(s.col3, ''), w.col3)

    SELECT*

    FROM@Sample


    N 56°04'39.16"
    E 12°55'05.25"

  • Did some work for a mailing shop some years ago, we used a method like this for the same thing:

    [font="Courier New"]DROP TABLE #Test

    CREATE TABLE #Test (RECID INT, COL1 CHAR(1), COL2 CHAR(1), COL3 CHAR(1))

    INSERT INTO #Test (RECID, COL1, COL2, COL3)

    SELECT 1, 'A', 'B', 'C' UNION ALL

    SELECT 2, 'A', 'B', ' ' UNION ALL

    SELECT 3, 'A', ' ', 'C' UNION ALL

    SELECT 4, ' ', 'B', 'C' UNION ALL

    SELECT 5, ' ', 'B', 'C'

    SELECT a.RECID AS RECIDmaster, b.RECID AS COL1BlankDupe, c.RECID AS COL2BlankDupe, d.RECID AS COL3BlankDupe

    FROM #Test a

    INNER JOIN #Test b

       ON a.RECID <> b.RECID

       AND (b.COL1 = ' ' AND a.COL1 <> ' ')

       AND b.COL2 = a.COL2

       AND b.COL3 = a.COL3

    INNER JOIN #Test c

       ON a.RECID <> c.RECID

       AND c.COL1 = a.COL1

       AND (c.COL2 = ' ' AND a.COL2 <> ' ')

       AND c.COL3 = a.COL3

    INNER JOIN #Test d

       ON a.RECID <> d.RECID

       AND d.COL1 = a.COL1

       AND d.COL2 = a.COL2

       AND (d.COL3 = ' ' AND a.COL3 <> ' ')

    ORDER BY a.RECID

    [/font]

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply