December 17, 2009 at 2:54 pm
I am working on some code that will identify duplicates in our database based off of Double Metaphone. What I am attempting to do now is assign a "Merge ID" to the records so that I can group the duplicates together. For instance, in the sample code I would like the merge_id for primary_ids 185799 and 77669 to be 1. For the remaining two I'd like the merge_id to be 2. As you can see it is numbering them 1 through 4. Can anyone point out what I am missing? (FYI - In my example I am not looking for the temp table and dropping it if it exists because when I included that it wouldn't let me post to the forum.)
CREATE TABLE #MergeList
(
merge_idINT,
primary_idINT,
city1VARCHAR(30),
city1_dmVARCHAR(5),
city1_dm2VARCHAR(5),
state1VARCHAR(30),
firstnameVARCHAR(20),
lastnameVARCHAR(20),
firstname_dmVARCHAR(5),
firstname_dm2VARCHAR(5),
lastname_dmVARCHAR(5),
lastname_dm2VARCHAR(5),
birthdayDATETIME
)
INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,
lastname_dm2, birthday) VALUES (185799, 'WYKOFF', 'AKF', 'FKF', 'MN', 'RANDY', 'MEYER', 'RNT', NULL, 'MR', NULL, '1959-09-25 00:00:00.000')
INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,
lastname_dm2, birthday) VALUES (77669, 'WYKOFF', 'AKF', 'FKF', 'MN', 'rANDY', 'mEYER', 'RNT', NULL, 'MR', NULL, '1959-09-25 00:00:00.000')
INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,
lastname_dm2, birthday) VALUES (120928, 'Algona', 'ALKN', NULL, 'IA', 'Jon', 'Menekee', 'JN', 'AN', 'MNK', NULL, '1982-05-20 00:00:00.000')
INSERT INTO #MergeList(primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,
lastname_dm2, birthday) VALUES (167953, 'ALGONA', 'ALKN', NULL, 'IA', 'JON', 'MENNEKE', 'JN', 'AN', 'MNK', NULL, '1982-05-20 00:00:00.000')
DECLARE@primary_idINT
DECLARE@city1_dmVARCHAR(5)
DECLARE@city1_dm2VARCHAR(5)
DECLARE@firstname_dmVARCHAR(5)
DECLARE@firstname_dm2VARCHAR(5)
DECLARE@lastname_dmVARCHAR(5)
DECLARE@lastname_dm2VARCHAR(5)
DECLARE@birthdayDATETIME
DECLARE @MergeGroup_IDINT
--Initialize MergeGroup_ID
IF (SELECT ISNULL(MAX(merge_id),0) FROM #MergeList) = 0
SET @MergeGroup_ID = 1
ELSE
SET @MergeGroup_ID = (SELECT MAX(merge_id) +1 FROM #MergeList)
SET NOCOUNT ON
DECLARE crs_merge_group CURSOR FOR
SELECT
primary_id, city1_dm, city1_dm2, firstname_dm, firstname_dm2, lastname_dm, lastname_dm2, birthday
FROM
#MergeList
WHERE primary_id IN (185799, 77669, 167953, 120928)
OPEN crs_merge_group
FETCH NEXT FROM crs_merge_group
INTO @primary_id, @city1_dm, @city1_dm2, @firstname_dm, @firstname_dm2, @lastname_dm, @lastname_dm2, @birthday
IF @@FETCH_STATUS <> 0 PRINT 'DEBUG:can''t open cursor'
WHILE (@@FETCH_STATUS = 0)
BEGIN
IF EXISTS
(SELECT primary_id
FROM
#MergeList
WHERE
city1_dm = @city1_dm
AND city1_dm2 = @city1_dm2
AND firstname_dm = @firstname_dm
AND firstname_dm2 = @firstname_dm2
AND lastname_dm= @lastname_dm
AND lastname_dm2 = @lastname_dm2
AND birthday = @birthday
AND merge_id IS NOT NULL
)
BEGIN
UPDATE #MergeList
SET merge_id =
(
SELECT TOP 1
merge_id
FROM
#MergeList
WHERE
city1_dm = @city1_dm
AND city1_dm2 = @city1_dm2
AND firstname_dm = @firstname_dm
AND firstname_dm2 = @firstname_dm2
AND lastname_dm= @lastname_dm
AND lastname_dm2 = @lastname_dm2
AND birthday = @birthday
AND merge_id IS NOT NULL
)
WHERE CURRENT OF crs_merge_group
END
ELSE
UPDATE #MergeList
SET merge_id = @MergeGroup_ID
WHERE CURRENT OF crs_merge_group
SET @MergeGroup_ID = @MergeGroup_ID +1
FETCH NEXT FROM crs_merge_group
INTO @primary_id, @city1_dm, @city1_dm2, @firstname_dm, @firstname_dm2, @lastname_dm, @lastname_dm2, @birthday
END
CLOSE crs_merge_group
DEALLOCATE crs_merge_group
SELECT * FROM #MergeList
December 18, 2009 at 10:08 am
ehlinger, you don't need a cursor. You could try something like this...
IF OBJECT_ID('TempDB..#Dupes','u') IS NOT NULL
DROP TABLE #Dupes
SELECT DISTINCT
t1.city1,
t1.city1_dm,
t1.city1_dm2,
t1.state1,
t1.firstname,
t1.lastname,
t1.firstname_dm,
t1.firstname_dm2,
t1.lastname_dm,
t1.lastname_dm2,
t1.birthday
INTO #Dupes
FROM #Mergelist t1 INNER JOIN #MergeList t2
ON ISNULL(t1.city1,'') = ISNULL(t2.city1,'')
AND ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')
AND ISNULL(t1.city1_dm2,'') = ISNULL(t2.city1_dm2,'')
AND ISNULL(t1.state1,'') = ISNULL(t2.state1,'')
AND ISNULL(t1.firstname,'') = ISNULL(t2.firstname,'')
AND ISNULL(t1.lastname,'') = ISNULL(t2.lastname,'')
AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')
AND ISNULL(t1.firstname_dm2,'') = ISNULL(t2.firstname_dm2,'')
AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')
AND ISNULL(t1.lastname_dm2,'') = ISNULL(t2.lastname_dm2,'')
AND ISNULL(t1.birthday,'') = ISNULL(t2.birthday,'')
SELECT * FROM #Dupes
ALTER TABLE #Dupes
ADD MergeID INT IDENTITY(1,1)
SELECT * FROM #Dupes
Please note that your sample data has two different spellings of one of the last names, 'Menekee', and 'MENNEKE'. I changed them to be the same assuming this was a typo. However, if they really do have two different spellings, then as far as SQL is concerned, they are not dupes. You can comment out any of the statements in the join criteria to over come this problem. Let us know if you have questions.
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
December 18, 2009 at 10:27 am
December 18, 2009 at 12:34 pm
mrpolecat (12/18/2009)
one more step from Greg's code is to update the #mergelist table mergeid with the identity from the dups table joined on the double metaphone fields
DOH! I totally missed the point I think. Coincidentally I was just beginning to read Jeff Moden's re-write of his running total article, found here, http://www.sqlservercentral.com/articles/T-SQL/68467/, so although this is not a running total problem, it is similar in how it can be solved. So, try the below code. I strongly suggest you read the article, since I can't begin to explain how the update works better than Jeff does. And, admitedly, I have only read the first few paragraphs so far, so I am going on what I remember from the first article. Also, I only included first name last name, and birthday in the update, since it would have been a lot more typing when we did not need all the fields for your sample data. However, your real data may neccessitate your pulling in more fields, if you have, say, two John Smiths with the same birthday in two different cities, and they both have dupe records. You should be able to add fields as needed. I'm sure I am missing something since the sample data is small, so I'm going to continue to mess with it, and will let you know if I come accross any "gotcha's". Let us know if you have questions.
IF OBJECT_ID('TempDB..#Dupes','u') IS NOT NULL
DROP TABLE #Dupes
SELECT
t1.primary_id,
t1.city1,
t1.city1_dm,
t1.city1_dm2,
t1.state1,
t1.firstname,
t1.lastname,
t1.firstname_dm,
t1.firstname_dm2,
t1.lastname_dm,
t1.lastname_dm2,
t1.birthday
INTO #Dupes
FROM #Mergelist t1 INNER JOIN #MergeList t2
ON ISNULL(t1.city1,'') = ISNULL(t2.city1,'')
AND ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')
AND ISNULL(t1.city1_dm2,'') = ISNULL(t2.city1_dm2,'')
AND ISNULL(t1.state1,'') = ISNULL(t2.state1,'')
AND ISNULL(t1.firstname,'') = ISNULL(t2.firstname,'')
AND ISNULL(t1.lastname,'') = ISNULL(t2.lastname,'')
AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')
AND ISNULL(t1.firstname_dm2,'') = ISNULL(t2.firstname_dm2,'')
AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')
AND ISNULL(t1.lastname_dm2,'') = ISNULL(t2.lastname_dm2,'')
AND ISNULL(t1.birthday,'') = ISNULL(t2.birthday,'')
AND t1.primary_id <> t2.primary_id
ORDER BY t1.lastname,t1.firstname,t1.primary_id
--Add a clustered index to ensure the order of the update
--to the MergeID column. Ultimately, all columns other
--than the primary_id should be the same for every set of
--dupes, so you really just need to decide which ones to
--include with the primary_id to ensure dupe rows stay
--together, since ordering by primary_id alone would not
--keep the dupes together. You could very well have two
--sets of dupes where all fields are the same except for
--birthday, or city. You will need to decide which fields
--will cover all your bases. I left is simple for ease
--of understanding.
CREATE CLUSTERED INDEX IXC_primary_id_last_first
ON #Dupes (lastname,firstname,primary_id)
ALTER TABLE #Dupes
ADD MergeID INT
GO
DECLARE @firstname VARCHAR(20)
DECLARE @lastname VARCHAR(20)
DECLARE @birthday DATETIME
DECLARE @MergeID INT
SET @firstname = ''
SET @lastname = ''
SET @birthday = ''
SET @MergeID = 0
UPDATE #Dupes
SET @MergeID = CASE WHEN @firstname = firstname
AND @lastname = lastname
AND @birthday = birthday
THEN @MergeID ELSE @MergeID + 1 END,
@firstname = firstname,
@lastname = lastname,
@birthday = birthday,
MergeID = @MergeID
SELECT
MergeID,
primary_id,
firstname,
lastname
FROM #Dupes
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
December 18, 2009 at 1:47 pm
Just checkin in here. I took your records for Jon Menneke, and duplicated them, but changed the birthday and primary_id, just to simulate what I was talking about by having sets of similar dupes with one field different. The below is extra test data I tried. For the birthday, I added 1 day, and for the primary_id, I changed the second digit to a 4.
INSERT INTO #MergeList (primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,
lastname_dm2, birthday) VALUES (140928, 'Algona', 'ALKN', NULL, 'IA', 'Jon', 'Menneke', 'JN', 'AN', 'MNK', NULL, '1982-05-21 00:00:00.000')
INSERT INTO #MergeList (primary_id, city1, city1_dm, city1_dm2, state1, firstname, lastname, firstname_dm, firstname_dm2, lastname_dm,
lastname_dm2, birthday) VALUES (147953, 'ALGONA', 'ALKN', NULL, 'IA', 'JON', 'MENNEKE', 'JN', 'AN', 'MNK', NULL, '1982-05-21 00:00:00.000')
And the changed index needed to make it work. I also tried it without an ORDER BY on the INSERT, and it still worked like a charm. If you have a large amount of columns in addition to what you posted as sample data, then you may be limited by the maximum size of the index.
CREATE CLUSTERED INDEX IXC_lastname_firstname_bday_primary_id
ON #Dupes (lastname,firstname,birthday,primary_id)
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
December 18, 2009 at 1:58 pm
I don't know where the OP is but from the original request it loks like you need the index on the dm and dm2 fields. Those are is Dual Metaphone fileds that he is using to declare duplicates along with the birthday field.I looked at using Jeff's "quirky update" before you posted your original response. I think your first way is easier by selecting the unique fields used for the duping into a temp table with an identity field and then joining that back to the original table to update the merge Id with the identity field from the dup table. That will easily satisfy the requirement of
"What I am attempting to do now is assign a "Merge ID" to the records so that I can group the duplicates together. For instance, in the sample code I would like the merge_id for primary_ids 185799 and 77669 to be 1. For the remaining two I'd like the merge_id to be 2."
December 18, 2009 at 2:11 pm
Yeah, I was not too clear on which columns would determine a dupe, so I thought I would just include them all, and then the OP can comment out what is not needed. As for the quirky update, I only did it because my SQL has been getting rusty these past several months, so it looked like some good exercise. Anyhow, I guess we'll just have to wait and see if the problem is solved or not.
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
December 18, 2009 at 3:38 pm
Thanks for the input guys. I was working on another issue today and just came out and looked to see if any posts had been made. (I didn't receive my alerts...) Anyway, I will look at all of this and work on it for the remainder of the day and over the weekend and will post what I come up with.
The misspellings of the name is correct because I'm matching on the dm fields.
December 21, 2009 at 8:22 am
ehlinger (12/18/2009)
The misspellings of the name is correct because I'm matching on the dm fields.
Okay, well if that is the case, this simpler version should help. I put the mis-spelling back, and it still worked when matching only on the dm fields. What about dm2?
IF OBJECT_ID('TempDB..#Dupes','u') IS NOT NULL
DROP TABLE #Dupes
SELECT DISTINCT
t1.primary_id,
t1.city1_dm,
t1.firstname_dm,
t1.lastname_dm
INTO #Dupes
FROM #Mergelist t1 INNER JOIN #MergeList t2
ON ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')
AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')
AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')
ORDER BY t1.lastname_dm,t1.firstname_dm,t1.city1_dm,t1.primary_id
CREATE CLUSTERED INDEX IXC_lastnameDM_firstnameDM_city1DM_primary_id
ON #Dupes (lastname_dm,firstname_dm,city1_dm,primary_id)
ALTER TABLE #Dupes
ADD MergeID INT
GO
DECLARE @city1_dm VARCHAR(5)
DECLARE @lastname_dm VARCHAR(5)
DECLARE @firstname_dm VARCHAR(5)
DECLARE @MergeID INT
SET @city1_dm = ''
SET @lastname_dm = ''
SET @firstname_dm = ''
SET @MergeID = 0
UPDATE #Dupes
SET @MergeID = CASE WHEN @city1_dm = city1_dm
AND @lastname_dm = lastname_dm
AND @firstname_dm = firstname_dm
THEN @MergeID ELSE @MergeID + 1 END,
@city1_dm = city1_dm,
@lastname_dm = lastname_dm,
@firstname_dm = firstname_dm,
MergeID = @MergeID
SELECT
MergeID,
primary_id,
city1_dm,
lastname_dm,
firstname_dm
FROM #Dupes
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
December 21, 2009 at 11:53 am
ehlinger, I apologize for any confusion, but I just took the time to finish reading Jeff's article, and I realized I have omitted two important aspects; locking the table, and preventing paralellism from affecting the update. Use the below for the update, all the rest is the same. Notice the inclusion of the FROM clause and the MAXDOP option.
UPDATE #Dupes
SET @MergeID = CASE WHEN @city1_dm = city1_dm
AND @lastname_dm = lastname_dm
AND @firstname_dm = firstname_dm
THEN @MergeID ELSE @MergeID + 1 END,
@city1_dm = city1_dm,
@lastname_dm = lastname_dm,
@firstname_dm = firstname_dm,
MergeID = @MergeID
FROM #Dupes WITH (TABLOCKX)
OPTION (MAXDOP 1)
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
December 21, 2009 at 3:34 pm
Greg,
No need to apologize. Thank you for all of your help. Works like a charm with the test data. I added more fields to match on and in actuality may need to set up a more complex join on those but I'll tackle that later if I have to. Getting ready to run it on some actual data and think it should be fine. I've included the end result. Thanks again.
Side note: Still can't post code with the word drop in it. It won't preview or post. I noticed you had no problem and just thought I'd ask if you are doing anything special.
SELECT DISTINCT
t1.primary_id,
t1.city1,
t1.city1_dm,
t1.city1_dm2,
t1.firstname,
t1.firstname_dm,
t1.firstname_dm2,
t1.lastname,
t1.lastname_dm,
t1.lastname_dm2,
t1.birthday
INTO #Dupes
FROM #Mergelist t1
INNER JOIN #MergeList t2 ON ISNULL(t1.city1_dm,'') = ISNULL(t2.city1_dm,'')
AND ISNULL(t1.city1_dm2,'') = ISNULL(t2.city1_dm2,'')
AND ISNULL(t1.firstname_dm,'') = ISNULL(t2.firstname_dm,'')
AND ISNULL(t1.firstname_dm2,'') = ISNULL(t2.firstname_dm2,'')
AND ISNULL(t1.lastname_dm,'') = ISNULL(t2.lastname_dm,'')
AND ISNULL(t1.lastname_dm2,'') = ISNULL(t2.lastname_dm2,'')
AND t1.birthday = t2.birthday
ORDER BY
t1.city1_dm,
t1.city1_dm2,
t1.firstname_dm,
t1.firstname_dm2,
t1.lastname_dm,
t1.lastname_dm2,
t1.birthday,
t1.primary_id
CREATE CLUSTERED INDEX IXC_lastnameDM_firstnameDM_city1DM_primary_id
ON #Dupes (city1_dm, city1_dm2, firstname_dm, firstname_dm2, lastname_dm, lastname_dm2, birthday, primary_id)
ALTER TABLE #Dupes
ADD MergeID INT
GO
DECLARE @city1_dmVARCHAR(5)
DECLARE @city1_dm2VARCHAR(5)
DECLARE @lastname_dmVARCHAR(5)
DECLARE @lastname_dm2VARCHAR(5)
DECLARE @firstname_dmVARCHAR(5)
DECLARE @firstname_dm2VARCHAR(5)
DECLARE @birthdayDATETIME
DECLARE @MergeIDINT
SET @city1_dm = ''
SET @city1_dm2 = ''
SET @firstname_dm = ''
SET @firstname_dm2 = ''
SET @lastname_dm = ''
SET @lastname_dm2 = ''
SET @birthday = ''
SET @MergeID = 0
UPDATE #Dupes
SET @MergeID =
CASE WHEN @city1_dm = ISNULL(city1_dm, '')
AND @city1_dm2 = ISNULL(city1_dm2, '')
AND @firstname_dm = ISNULL(firstname_dm, '')
AND @firstname_dm2 = ISNULL(firstname_dm2, '')
AND @lastname_dm = ISNULL(lastname_dm, '')
AND @lastname_dm2 = ISNULL(lastname_dm2, '')
AND @birthday = birthday
THEN @MergeID ELSE @MergeID + 1 END,
@city1_dm = ISNULL(city1_dm, ''),
@city1_dm2 = ISNULL(city1_dm2, ''),
@firstname_dm = ISNULL(firstname_dm, ''),
@firstname_dm2 = ISNULL(firstname_dm2, ''),
@lastname_dm = ISNULL(lastname_dm, ''),
@lastname_dm2 = ISNULL(lastname_dm2, ''),
@birthday = birthday,
MergeID = @MergeID
FROM #Dupes WITH (TABLOCKX)
OPTION (MAXDOP 1)
SELECT *
FROM #Dupes
December 22, 2009 at 6:48 am
ehlinger (12/21/2009)
Side note: Still can't post code with the word drop in it. It won't preview or post. I noticed you had no problem and just thought I'd ask if you are doing anything special.
DROP this
DROP that
DROP the other
I don't know what to tell you, I have never heard of your problem. It may matter which browser you are using. Also, if you are using a local temp table, you probably don't need to worry about locking it, but it may not hurt to get in the habit. Thanks for the feedback, and I'm glad I could help.
Greg
_________________________________________________________________________________________________
The glass is at one half capacity: nothing more, nothing less.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply