Assign Group Numbers to Sets of Records

  • I am working on some code that will identify duplicates in our database based off of Double Metaphone. What I am attempting to do now is assign a "Merge ID" to the records so that I can group the duplicates together. For instance, in the sample code I would like the merge_id for primary_ids 185799 and 77669 to be 1. For the remaining two I'd like the merge_id to be 2. As you can see it is numbering them 1 through 4. Can anyone point out what I am missing? (FYI - In my example I am not looking for the temp table and dropping it if it exists because when I included that it wouldn't let me post to the forum.)

    CREATE TABLE #MergeList

    (

    merge_idINT,

    primary_idINT,

    city1VARCHAR(30),

    city1_dmVARCHAR(5),

    city1_dm2VARCHAR(5),

    state1VARCHAR(30),

    firstnameVARCHAR(20),

    lastnameVARCHAR(20),

    firstname_dmVARCHAR(5),

    firstname_dm2VARCHAR(5),

    lastname_dmVARCHAR(5),

    lastname_dm2VARCHAR(5),

    birthdayDATETIME

    )

    INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,

    lastname_dm2, birthday) VALUES (185799, 'WYKOFF', 'AKF', 'FKF', 'MN', 'RANDY', 'MEYER', 'RNT', NULL, 'MR', NULL, '1959-09-25 00:00:00.000')

    INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,

    lastname_dm2, birthday) VALUES (77669, 'WYKOFF', 'AKF', 'FKF', 'MN', 'rANDY', 'mEYER', 'RNT', NULL, 'MR', NULL, '1959-09-25 00:00:00.000')

    INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,

    lastname_dm2, birthday) VALUES (120928, 'Algona', 'ALKN', NULL, 'IA', 'Jon', 'Menekee', 'JN', 'AN', 'MNK', NULL, '1982-05-20 00:00:00.000')

    INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,

    lastname_dm2, birthday) VALUES (167953, 'ALGONA', 'ALKN', NULL, 'IA', 'JON', 'MENNEKE', 'JN', 'AN', 'MNK', NULL, '1982-05-20 00:00:00.000')

    DECLARE@primary_idINT

    DECLARE@city1_dmVARCHAR(5)

    DECLARE@city1_dm2VARCHAR(5)

    DECLARE@firstname_dmVARCHAR(5)

    DECLARE@firstname_dm2VARCHAR(5)

    DECLARE@lastname_dmVARCHAR(5)

    DECLARE@lastname_dm2VARCHAR(5)

    DECLARE@birthdayDATETIME

    DECLARE @MergeGroup_IDINT

    --Initialize MergeGroup_ID

    IF (SELECT ISNULL(MAX(merge_id),0) FROM #MergeList) = 0

    SET @MergeGroup_ID = 1

    ELSE

    SET @MergeGroup_ID = (SELECT MAX(merge_id) +1 FROM #MergeList)

    SET NOCOUNT ON

    DECLARE crs_merge_group CURSOR FOR

    SELECT

    primary_id, city1_dm, city1_dm2, firstname_dm, firstname_dm2, lastname_dm, lastname_dm2, birthday

    FROM

    #MergeList

    WHERE primary_id IN (185799, 77669, 167953, 120928)

    OPEN crs_merge_group

    FETCH NEXT FROM crs_merge_group

    INTO @primary_id, @city1_dm, @city1_dm2, @firstname_dm, @firstname_dm2, @lastname_dm, @lastname_dm2, @birthday

    IF @@FETCH_STATUS <> 0 PRINT 'DEBUG:can''t open cursor'

    WHILE (@@FETCH_STATUS = 0)

    BEGIN

    IF EXISTS

    (SELECT primary_id

    FROM

    #MergeList

    WHERE

    city1_dm = @city1_dm

    AND city1_dm2 = @city1_dm2

    AND firstname_dm = @firstname_dm

    AND firstname_dm2 = @firstname_dm2

    AND lastname_dm= @lastname_dm

    AND lastname_dm2 = @lastname_dm2

    AND birthday = @birthday

    AND merge_id IS NOT NULL

    )

    BEGIN

    UPDATE #MergeList

    SET merge_id =

    (

    SELECT TOP 1

    merge_id

    FROM

    #MergeList

    WHERE

    city1_dm = @city1_dm

    AND city1_dm2 = @city1_dm2

    AND firstname_dm = @firstname_dm

    AND firstname_dm2 = @firstname_dm2

    AND lastname_dm= @lastname_dm

    AND lastname_dm2 = @lastname_dm2

    AND birthday = @birthday

    AND merge_id IS NOT NULL

    )

    WHERE CURRENT OF crs_merge_group

    END

    ELSE

    UPDATE #MergeList

    SET merge_id = @MergeGroup_ID

    WHERE CURRENT OF crs_merge_group

    SET @MergeGroup_ID = @MergeGroup_ID +1

    FETCH NEXT FROM crs_merge_group

    INTO @primary_id, @city1_dm, @city1_dm2, @firstname_dm, @firstname_dm2, @lastname_dm, @lastname_dm2, @birthday

    END

    CLOSE crs_merge_group

    DEALLOCATE crs_merge_group

    SELECT * FROM #MergeList

  • ehlinger, you don't need a cursor. You could try something like this...

    IF OBJECT_ID('TempDB..#Dupes','u') IS NOT NULL

    DROP TABLE #Dupes

    SELECT DISTINCT

    t1.city1,

    t1.city1_dm,

    t1.city1_dm2,

    t1.state1,

    t1.firstname,

    t1.lastname,

    t1.firstname_dm,

    t1.firstname_dm2,

    t1.lastname_dm,

    t1.lastname_dm2,

    t1.birthday

    INTO #Dupes

    FROM #Mergelist t1 INNER JOIN #MergeList t2

    ON ISNULL(t1.city1,'') = ISNULL(t2.city1,'')

    AND ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')

    AND ISNULL(t1.city1_dm2,'') = ISNULL(t2.city1_dm2,'')

    AND ISNULL(t1.state1,'') = ISNULL(t2.state1,'')

    AND ISNULL(t1.firstname,'') = ISNULL(t2.firstname,'')

    AND ISNULL(t1.lastname,'') = ISNULL(t2.lastname,'')

    AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')

    AND ISNULL(t1.firstname_dm2,'') = ISNULL(t2.firstname_dm2,'')

    AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')

    AND ISNULL(t1.lastname_dm2,'') = ISNULL(t2.lastname_dm2,'')

    AND ISNULL(t1.birthday,'') = ISNULL(t2.birthday,'')

    SELECT * FROM #Dupes

    ALTER TABLE #Dupes

    ADD MergeID INT IDENTITY(1,1)

    SELECT * FROM #Dupes

    Please note that your sample data has two different spellings of one of the last names, 'Menekee', and 'MENNEKE'. I changed them to be the same assuming this was a typo. However, if they really do have two different spellings, then as far as SQL is concerned, they are not dupes. You can comment out any of the statements in the join criteria to over come this problem. Let us know if you have questions.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • one more step from Greg's code is to update the #mergelist table mergeid with the identity from the dups table joined on the double metaphone fields


  • mrpolecat (12/18/2009)


    one more step from Greg's code is to update the #mergelist table mergeid with the identity from the dups table joined on the double metaphone fields

    DOH! I totally missed the point I think. Coincidentally I was just beginning to read Jeff Moden's re-write of his running total article, found here, http://www.sqlservercentral.com/articles/T-SQL/68467/, so although this is not a running total problem, it is similar in how it can be solved. So, try the below code. I strongly suggest you read the article, since I can't begin to explain how the update works better than Jeff does. And, admitedly, I have only read the first few paragraphs so far, so I am going on what I remember from the first article. Also, I only included first name last name, and birthday in the update, since it would have been a lot more typing when we did not need all the fields for your sample data. However, your real data may neccessitate your pulling in more fields, if you have, say, two John Smiths with the same birthday in two different cities, and they both have dupe records. You should be able to add fields as needed. I'm sure I am missing something since the sample data is small, so I'm going to continue to mess with it, and will let you know if I come accross any "gotcha's". Let us know if you have questions.

    IF OBJECT_ID('TempDB..#Dupes','u') IS NOT NULL

    DROP TABLE #Dupes

    SELECT

    t1.primary_id,

    t1.city1,

    t1.city1_dm,

    t1.city1_dm2,

    t1.state1,

    t1.firstname,

    t1.lastname,

    t1.firstname_dm,

    t1.firstname_dm2,

    t1.lastname_dm,

    t1.lastname_dm2,

    t1.birthday

    INTO #Dupes

    FROM #Mergelist t1 INNER JOIN #MergeList t2

    ON ISNULL(t1.city1,'') = ISNULL(t2.city1,'')

    AND ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')

    AND ISNULL(t1.city1_dm2,'') = ISNULL(t2.city1_dm2,'')

    AND ISNULL(t1.state1,'') = ISNULL(t2.state1,'')

    AND ISNULL(t1.firstname,'') = ISNULL(t2.firstname,'')

    AND ISNULL(t1.lastname,'') = ISNULL(t2.lastname,'')

    AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')

    AND ISNULL(t1.firstname_dm2,'') = ISNULL(t2.firstname_dm2,'')

    AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')

    AND ISNULL(t1.lastname_dm2,'') = ISNULL(t2.lastname_dm2,'')

    AND ISNULL(t1.birthday,'') = ISNULL(t2.birthday,'')

    AND t1.primary_id <> t2.primary_id

    ORDER BY t1.lastname,t1.firstname,t1.primary_id

    --Add a clustered index to ensure the order of the update

    --to the MergeID column. Ultimately, all columns other

    --than the primary_id should be the same for every set of

    --dupes, so you really just need to decide which ones to

    --include with the primary_id to ensure dupe rows stay

    --together, since ordering by primary_id alone would not

    --keep the dupes together. You could very well have two

    --sets of dupes where all fields are the same except for

    --birthday, or city. You will need to decide which fields

    --will cover all your bases. I left is simple for ease

    --of understanding.

    CREATE CLUSTERED INDEX IXC_primary_id_last_first

    ON #Dupes (lastname,firstname,primary_id)

    ALTER TABLE #Dupes

    ADD MergeID INT

    GO

    DECLARE @firstname VARCHAR(20)

    DECLARE @lastname VARCHAR(20)

    DECLARE @birthday DATETIME

    DECLARE @MergeID INT

    SET @firstname = ''

    SET @lastname = ''

    SET @birthday = ''

    SET @MergeID = 0

    UPDATE #Dupes

    SET @MergeID = CASE WHEN @firstname = firstname

    AND @lastname = lastname

    AND @birthday = birthday

    THEN @MergeID ELSE @MergeID + 1 END,

    @firstname = firstname,

    @lastname = lastname,

    @birthday = birthday,

    MergeID = @MergeID

    SELECT

    MergeID,

    primary_id,

    firstname,

    lastname

    FROM #Dupes

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Just checkin in here. I took your records for Jon Menneke, and duplicated them, but changed the birthday and primary_id, just to simulate what I was talking about by having sets of similar dupes with one field different. The below is extra test data I tried. For the birthday, I added 1 day, and for the primary_id, I changed the second digit to a 4.

    INSERT INTO #MergeList (primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,

    lastname_dm2, birthday) VALUES (140928, 'Algona', 'ALKN', NULL, 'IA', 'Jon', 'Menneke', 'JN', 'AN', 'MNK', NULL, '1982-05-21 00:00:00.000')

    INSERT INTO #MergeList (primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,

    lastname_dm2, birthday) VALUES (147953, 'ALGONA', 'ALKN', NULL, 'IA', 'JON', 'MENNEKE', 'JN', 'AN', 'MNK', NULL, '1982-05-21 00:00:00.000')

    And the changed index needed to make it work. I also tried it without an ORDER BY on the INSERT, and it still worked like a charm. If you have a large amount of columns in addition to what you posted as sample data, then you may be limited by the maximum size of the index.

    CREATE CLUSTERED INDEX IXC_lastname_firstname_bday_primary_id

    ON #Dupes (lastname,firstname,birthday,primary_id)

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • I don't know where the OP is but from the original request it loks like you need the index on the dm and dm2 fields. Those are is Dual Metaphone fileds that he is using to declare duplicates along with the birthday field.I looked at using Jeff's "quirky update" before you posted your original response. I think your first way is easier by selecting the unique fields used for the duping into a temp table with an identity field and then joining that back to the original table to update the merge Id with the identity field from the dup table. That will easily satisfy the requirement of

    "What I am attempting to do now is assign a "Merge ID" to the records so that I can group the duplicates together. For instance, in the sample code I would like the merge_id for primary_ids 185799 and 77669 to be 1. For the remaining two I'd like the merge_id to be 2."


  • Yeah, I was not too clear on which columns would determine a dupe, so I thought I would just include them all, and then the OP can comment out what is not needed. As for the quirky update, I only did it because my SQL has been getting rusty these past several months, so it looked like some good exercise. Anyhow, I guess we'll just have to wait and see if the problem is solved or not.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Thanks for the input guys. I was working on another issue today and just came out and looked to see if any posts had been made. (I didn't receive my alerts...) Anyway, I will look at all of this and work on it for the remainder of the day and over the weekend and will post what I come up with.

    The misspellings of the name is correct because I'm matching on the dm fields.

  • ehlinger (12/18/2009)


    The misspellings of the name is correct because I'm matching on the dm fields.

    Okay, well if that is the case, this simpler version should help. I put the mis-spelling back, and it still worked when matching only on the dm fields. What about dm2?

    IF OBJECT_ID('TempDB..#Dupes','u') IS NOT NULL

    DROP TABLE #Dupes

    SELECT DISTINCT

    t1.primary_id,

    t1.city1_dm,

    t1.firstname_dm,

    t1.lastname_dm

    INTO #Dupes

    FROM #Mergelist t1 INNER JOIN #MergeList t2

    ON ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')

    AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')

    AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')

    ORDER BY t1.lastname_dm,t1.firstname_dm,t1.city1_dm,t1.primary_id

    CREATE CLUSTERED INDEX IXC_lastnameDM_firstnameDM_city1DM_primary_id

    ON #Dupes (lastname_dm,firstname_dm,city1_dm,primary_id)

    ALTER TABLE #Dupes

    ADD MergeID INT

    GO

    DECLARE @city1_dm VARCHAR(5)

    DECLARE @lastname_dm VARCHAR(5)

    DECLARE @firstname_dm VARCHAR(5)

    DECLARE @MergeID INT

    SET @city1_dm = ''

    SET @lastname_dm = ''

    SET @firstname_dm = ''

    SET @MergeID = 0

    UPDATE #Dupes

    SET @MergeID = CASE WHEN @city1_dm = city1_dm

    AND @lastname_dm = lastname_dm

    AND @firstname_dm = firstname_dm

    THEN @MergeID ELSE @MergeID + 1 END,

    @city1_dm = city1_dm,

    @lastname_dm = lastname_dm,

    @firstname_dm = firstname_dm,

    MergeID = @MergeID

    SELECT

    MergeID,

    primary_id,

    city1_dm,

    lastname_dm,

    firstname_dm

    FROM #Dupes

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • ehlinger, I apologize for any confusion, but I just took the time to finish reading Jeff's article, and I realized I have omitted two important aspects; locking the table, and preventing paralellism from affecting the update. Use the below for the update, all the rest is the same. Notice the inclusion of the FROM clause and the MAXDOP option.

    UPDATE #Dupes

    SET @MergeID = CASE WHEN @city1_dm = city1_dm

    AND @lastname_dm = lastname_dm

    AND @firstname_dm = firstname_dm

    THEN @MergeID ELSE @MergeID + 1 END,

    @city1_dm = city1_dm,

    @lastname_dm = lastname_dm,

    @firstname_dm = firstname_dm,

    MergeID = @MergeID

    FROM #Dupes WITH (TABLOCKX)

    OPTION (MAXDOP 1)

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Greg,

    No need to apologize. Thank you for all of your help. Works like a charm with the test data. I added more fields to match on and in actuality may need to set up a more complex join on those but I'll tackle that later if I have to. Getting ready to run it on some actual data and think it should be fine. I've included the end result. Thanks again.

    Side note: Still can't post code with the word drop in it. It won't preview or post. I noticed you had no problem and just thought I'd ask if you are doing anything special.

    SELECT DISTINCT

    t1.primary_id,

    t1.city1,

    t1.city1_dm,

    t1.city1_dm2,

    t1.firstname,

    t1.firstname_dm,

    t1.firstname_dm2,

    t1.lastname,

    t1.lastname_dm,

    t1.lastname_dm2,

    t1.birthday

    INTO #Dupes

    FROM #Mergelist t1

    INNER JOIN #MergeList t2 ON ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')

    AND ISNULL(t1.city1_dm2,'') = ISNULL(t2.city1_dm2,'')

    AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')

    AND ISNULL(t1.firstname_dm2,'') = ISNULL(t2.firstname_dm2,'')

    AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')

    AND ISNULL(t1.lastname_dm2,'') = ISNULL(t2.lastname_dm2,'')

    AND t1.birthday = t2.birthday

    ORDER BY

    t1.city1_dm,

    t1.city1_dm2,

    t1.firstname_dm,

    t1.firstname_dm2,

    t1.lastname_dm,

    t1.lastname_dm2,

    t1.birthday,

    t1.primary_id

    CREATE CLUSTERED INDEX IXC_lastnameDM_firstnameDM_city1DM_primary_id

    ON #Dupes (city1_dm, city1_dm2, firstname_dm, firstname_dm2, lastname_dm, lastname_dm2, birthday, primary_id)

    ALTER TABLE #Dupes

    ADD MergeID INT

    GO

    DECLARE @city1_dmVARCHAR(5)

    DECLARE @city1_dm2VARCHAR(5)

    DECLARE @lastname_dmVARCHAR(5)

    DECLARE @lastname_dm2VARCHAR(5)

    DECLARE @firstname_dmVARCHAR(5)

    DECLARE @firstname_dm2VARCHAR(5)

    DECLARE @birthdayDATETIME

    DECLARE @MergeIDINT

    SET @city1_dm = ''

    SET @city1_dm2 = ''

    SET @firstname_dm = ''

    SET @firstname_dm2 = ''

    SET @lastname_dm = ''

    SET @lastname_dm2 = ''

    SET @birthday = ''

    SET @MergeID = 0

    UPDATE #Dupes

    SET @MergeID =

    CASE WHEN @city1_dm = ISNULL(city1_dm, '')

    AND @city1_dm2 = ISNULL(city1_dm2, '')

    AND @firstname_dm = ISNULL(firstname_dm, '')

    AND @firstname_dm2 = ISNULL(firstname_dm2, '')

    AND @lastname_dm = ISNULL(lastname_dm, '')

    AND @lastname_dm2 = ISNULL(lastname_dm2, '')

    AND @birthday = birthday

    THEN @MergeID ELSE @MergeID + 1 END,

    @city1_dm = ISNULL(city1_dm, ''),

    @city1_dm2 = ISNULL(city1_dm2, ''),

    @firstname_dm = ISNULL(firstname_dm, ''),

    @firstname_dm2 = ISNULL(firstname_dm2, ''),

    @lastname_dm = ISNULL(lastname_dm, ''),

    @lastname_dm2 = ISNULL(lastname_dm2, ''),

    @birthday = birthday,

    MergeID = @MergeID

    FROM #Dupes WITH (TABLOCKX)

    OPTION (MAXDOP 1)

    SELECT *

    FROM #Dupes

  • ehlinger (12/21/2009)


    Side note: Still can't post code with the word drop in it. It won't preview or post. I noticed you had no problem and just thought I'd ask if you are doing anything special.

    DROP this

    DROP that

    DROP the other

    I don't know what to tell you, I have never heard of your problem. It may matter which browser you are using. Also, if you are using a local temp table, you probably don't need to worry about locking it, but it may not hurt to get in the habit. Thanks for the feedback, and I'm glad I could help.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply