Merging duplicate records into one record

  • I have a set of emp records that contain duplicate entries. The problem with this duplicate is that they may point to information that is not contained in the other duplicated entry. For example, we keep track of United Fund donations and one record may have three donations, while the other record may have one donation. One record may contain the spouse information, where the other doesnt.

    I have flagged the duplicate entries, but I don't want to delete the duplicate, I want to merge into one record, then purge the entries that were duplicated. I want to add the spouse info, if available, and all the donation information from all records.

     

    I need help with the SQL that can do this without extensive parsing and processing since we have 100,000 records that fit into this category.

     

     

  • Can you post some example data?

  • table 1 employee contains empID, prefix, fname, mname, lname, addr1, addr2, addr3, city, state, zip, highDonation, totaldonation, first date, last date

    table 2 contains empID, donation, date, amount

    table 3 contains empID, letterkeycode, datesent

    I have replaceID as a field in table 1 which is null (will contain the new id of the merged record)

    duplicate is in table 1 on lname, addr1, city

    fname may contain duplicate, or contain a spouse name

    trick is to merge table 1, then merge table 2, and table3

  • I think the first thing you need to do is decide the order of precedence.

    For example, if you have a record in Table 1, Table 2 and Table 3 and the value for spouse name is different in each, which one should you choose???

    I would then update your records based on the order of precedence.  Eg.  If your order is like Table 1 = 1st Choice, Table 2 = 2nd Choice, Table 3 = 3rd Choice, I would then write a series of UPDATE's like:

    UPDATE Table1

    SET Table1.[Name] = (SELECT Table3.[Name] FROM Table3

                                  INNER JOIN Table1 ON Table3.[ID] = Table1.[ID])

    WHERE Table1.[Name] is NULL

    Then repeat the process replacing Table3 in the update with Table2.  After doing this you should be able to delete the tables with your duplicates.

    Someone else may have a more elegant way of doing this.

    Cheers,

    Angela

  • Actually, this is missing the question. The duplicate records are in table 1. Table 2 and 3 contain suplimentary information pointed to by Table 1. The order of precident is to select the record in Table 1 and then merge the records pointed to by table 1 into records pointed to by the newly merged record.

  • Okay what about this:

    -- First we get the all the records and any duplicates are given an id to tie them to the first record found with a matching lname, addr1 and city.

    SELECT DISTINCT(MAX(ta1.empid)) AS BadEmpid, t1.empid, t1.prefix, t1.fname, t1.mname, t1.lname, t1.addr1, t1.addr2, t1.addr3, t1.city, t1.state, t1.zip,t1.highdonation, t1.totaldonation, t1.firstdate, t1.lastdate

    INTO #TEMP from table1 ta1

    INNER JOIN Table1 T1 ON Ta1.lname = T1.Lname

    AND Ta1.addr1 = T1.addr1 AND Ta1.city = T1.city

    GROUP BY t1.empid,t1.prefix, t1.fname, t1.mname, t1.lname, t1.addr1, t1.addr2, t1.addr3, t1.city, t1.state, t1.zip,t1.highdonation, t1.totaldonation, t1.firstdate, t1.lastdate

    HAVING COUNT (t1.lname) >1

    ORDER BY t1.lname

    --  Then we update the original table.  Adding the details from the duplicate records to the first one found as we go.  This is based on a match on lname, addr1 and city.  This will not match simalar records, they must be a complete match.

    UPDATE table1

    SET table1.prefix = #temp.prefix,

    table1.fname = #temp.fname,

    table1.mname = #temp.mname,

    table1.lname =  #temp.lname,

    table1.addr1 = #temp.addr1,

    table1.addr2 = #temp.addr2,

    table1.addr3 = #temp.addr3,

    table1.city =  #temp.city,

    table1.state = #temp.state,

    table1.zip = #temp.zip,

    table1.highdonation = #temp.highdonation,

    table1.totaldonation = #temp.totaldonation,

    table1.firstdate = #temp.firstdate,

    table1.lastdate = #temp.lastdate

    FROM #temp

    INNER JOIN table1 ON (table1.lname = #temp.lname

                                  AND table1.addr1 = #temp.addr1

                                  AND table1.city = #temp.city )

    WHERE   table1.empid <> #temp.badempid

     

    Is this more useful?

    Cheers,

    Angela

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply