February 25, 2015 at 8:39 am
I have a table with fields: FirstName Last_Name Date_of_Birth and ID that could have many "flavors" to First_Name & Last_Name.
Currently the table has an ID which is an AutoNumber, Integer.
Example:
First_Name Last_Name Date_of_Birth ID
John Doe 1/1/1984 1
John J Doe 1/1/1984 2
John James Doe 1/1/1984 3
John Doe, J 1/1/1984 4
James Smith 12/1/1965 5
James K Smith 12/1/1965 6
James Smith 8/1/1973 7
I am looking/hoping to accomplish the following.
Keep all rows; but have the ID update to the minimum ID.
First_Name Last_Name Date_of_Birth ID
John Doe 1/1/1984 1
John J Doe 1/1/1984 1
John James Doe 1/1/1984 1
John Doe, J 1/1/1984 1
James Smith 12/1/1965 5
James K Smith 12/1/1965 5
James Smith 8/1/1973 7
My SQL skills drop off to figure out how to do this.
Any help would be greatly appreciated.
February 25, 2015 at 9:07 am
its possible, but you have to have defined the matching criteria.
until you define logic for identifying the duplicates, you cannot consolidate duplicates. also, you probably need additional logic, since you might have addresses, email, phone sthat can be used to identify duplicates as well.
I've done this in a multiple phase approach:
exact matches first,
strip out whitespace and pucntuation as part of the match (so that Louis CK matches Louis C.K.)
charindex containing names
strip out + charindex.
here's a quick example of #3:
Create Table #dupes(
First_Name varchar(50),
Last_Name varchar(50),
Date_of_Birth date,
ID int
)
INSERT INTO #dupes
SELECT 'John','Doe','1/1/1984','1' UNION ALL
SELECT 'John J','Doe','1/1/1984','2' UNION ALL
SELECT 'John James','Doe','1/1/1984','3' UNION ALL
SELECT 'John','Doe, J','1/1/1984','4' UNION ALL
SELECT 'James','Smith','12/1/1965','5' UNION ALL
SELECT 'James','K Smith','12/1/1965','6' UNION ALL
SELECT 'James','Smith','8/1/1973','7'
select * from #dupes T1
inner join #dupes T2
ON T1.Date_of_Birth = T2.Date_of_Birth
AND CHARINDEX(T2.Last_Name,T1.Last_Name) >= 0
AND CHARINDEX(T2.First_Name,T1.First_Name) >= 0
WHERE T1.ID <> T2.ID
select * from #dupes T1
inner join #dupes T2
ON T1.Date_of_Birth = T2.Date_of_Birth
AND CHARINDEX(T1.Last_Name,T2.Last_Name) >= 0
AND CHARINDEX(T1.First_Name,T2.First_Name) >= 0
WHERE T1.ID <> T2.ID
Lowell
February 25, 2015 at 9:33 am
I messed up with my example.
It's more like this:
FIRST_NAME MIDDLE_INITIAL LAST_NAME DOB MEMBER_ID ID
John Doe 01/01/1984 AB123456 1
John J Doe 01/01/1984 AB123456 2
John James Doe 01/01/1984 AB123456 3
James Smith 12/01/1965 ZY987654 4
James K Smith 12/01/1965 ZY987654 5
James Smith 08/20/1973 BB754321 6
Where Member_ID is the unique key
so I would like it to end up like this:
FIRST_NAME MIDDLE_INITIAL LAST_NAME DOB MEMBER_ID ID
John Doe 01/01/1984 AB123456 1
John J Doe 01/01/1984 AB123456 1
John James Doe 01/01/1984 AB123456 1
James Smith 12/01/1965 ZY987654 4
James K Smith 12/01/1965 ZY987654 4
James Smith 08/20/1973 BB754321 6
February 25, 2015 at 9:39 am
This is a very common issue. Unfortunately there is no magic button for this. It requires human levels of intelligence to sort this out. You can leverage sql to get some but it takes eyes on the data to get this correct.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 25, 2015 at 9:53 am
If you have a Member_ID is your correct identifier for each person, then this should give you the correct information.
CREATE TABLE SampleData(
FIRST_NAME varchar(20),
MIDDLE_INITIAL varchar(20),
LAST_NAME varchar(20),
DOB date,
MEMBER_ID char(8),
ID int)
INSERT INTO SampleData
SELECT 'John','','Doe', '01/01/1984', 'AB123456', 1 UNION ALL
SELECT 'John','J','Doe', '01/01/1984', 'AB123456', 2 UNION ALL
SELECT 'John','James','Doe', '01/01/1984', 'AB123456', 3 UNION ALL
SELECT 'James','','Smith', '12/01/1965', 'ZY987654', 4 UNION ALL
SELECT 'James','K','Smith', '12/01/1965', 'ZY987654', 5 UNION ALL
SELECT 'James','','Smith', '08/20/1973', 'BB754321', 6
SELECT *, ID - ROW_NUMBER() OVER(PARTITION BY MEMBER_ID ORDER BY ID) + 1 MinID
FROM SampleData
ORDER BY MinID
GO
DROP TABLE SampleData
February 25, 2015 at 9:56 am
I didn't look at the new sample data until after you posted Luis. I assumed this was the very common "I have data that is mangled all over the place that is easy to see with human eyes that needs to be sorted". However, looking at the new sample data it is a lot cleaner than most. Well done again sir!
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
February 25, 2015 at 10:11 am
Thank you Luis! That worked GREAT!!
February 25, 2015 at 10:14 am
Sean Lange (2/25/2015)
I didn't look at the new sample data until after you posted Luis. I assumed this was the very common "I have data that is mangled all over the place that is easy to see with human eyes that needs to be sorted". However, looking at the new sample data it is a lot cleaner than most. Well done again sir!
Thank you, I thought the same thing until the new data came along and made everything easier.
The main problem was to simulate the RANK() but using the ID order.
February 27, 2015 at 11:44 am
Just to point out something for the more paranoid of us, I believe Luis' solution assumes that all rows with the same MEMBER_ID use consecutive IDs with no gaps.
If there are gaps, or rows with the same MEMBER_ID do not have consecutive IDs, then the results will not be what you expect, methinks.
For concreteness:
CREATE TABLE SampleData(
FIRST_NAME varchar(20),
MIDDLE_INITIAL varchar(20),
LAST_NAME varchar(20),
DOB date,
MEMBER_ID char(8),
ID int)
INSERT INTO SampleData
SELECT 'James','K','Smith', '12/01/1965', 'ZY987654', 1 UNION ALL
SELECT 'John','','Doe', '01/01/1984', 'AB123456', 2 UNION ALL
SELECT 'John','J','Doe', '01/01/1984', 'AB123456', 3 UNION ALL
SELECT 'John','James','Doe', '01/01/1984', 'AB123456', 4 UNION ALL
SELECT 'James','','Smith', '12/01/1965', 'ZY987654', 5 UNION ALL
SELECT 'James','','Smith', '08/20/1973', 'BB754321', 6
SELECT *, ID - ROW_NUMBER() OVER(PARTITION BY MEMBER_ID ORDER BY ID) + 1 MinID
FROM SampleData
ORDER BY MinID
GO
DROP TABLE SampleData
--The above will give different MinIDs for each James Smith with MEMBER_ID ZY987654
CREATE TABLE SampleData(
FIRST_NAME varchar(20),
MIDDLE_INITIAL varchar(20),
LAST_NAME varchar(20),
DOB date,
MEMBER_ID char(8),
ID int)
INSERT INTO SampleData
SELECT 'John','','Doe', '01/01/1984', 'AB123456', 1 UNION ALL
SELECT 'John','J','Doe', '01/01/1984', 'AB123456', 2 UNION ALL
SELECT 'John','James','Doe', '01/01/1984', 'AB123456', 4 UNION ALL
SELECT 'James','','Smith', '12/01/1965', 'ZY987654', 5 UNION ALL
SELECT 'James','K','Smith', '12/01/1965', 'ZY987654', 6 UNION ALL
SELECT 'James','','Smith', '08/20/1973', 'BB754321', 7
SELECT *, ID - ROW_NUMBER() OVER(PARTITION BY MEMBER_ID ORDER BY ID) + 1 MinID
FROM SampleData
ORDER BY MinID
GO
DROP TABLE SampleData
--This one will give a different MinID for John James Doe than for John Doe and John J Doe.
Perhaps the data is so arranged that this isn't a problem (or I'm misunderstanding the scenario entirely, which is all too possible on a day with no coffee!), but I thought I'd mention this just to be safe 🙂
February 27, 2015 at 7:30 pm
Borrowing the table and data creation
CREATE TABLE SampleData(
FIRST_NAME varchar(20),
MIDDLE_INITIAL varchar(20),
LAST_NAME varchar(20),
DOB date,
MEMBER_ID char(8),
ID int)
INSERT INTO SampleData
SELECT 'John','','Doe', '01/01/1984', 'AB123456', 1 UNION ALL
SELECT 'John','J','Doe', '01/01/1984', 'AB123456', 2 UNION ALL
SELECT 'John','James','Doe', '01/01/1984', 'AB123456', 3 UNION ALL
SELECT 'James','','Smith', '12/01/1965', 'ZY987654', 4 UNION ALL
SELECT 'James','K','Smith', '12/01/1965', 'ZY987654', 5 UNION ALL
SELECT 'James','','Smith', '08/20/1973', 'BB754321', 6
You can also do
update t
set t.id = uv.minID
FROM sampleData t Join (
SELECT year(dob) as yr, min(id) as minId
from sampleData
group by dob
) uv
ON uv.yr = year(t.dob)
----------------------------------------------------
March 3, 2015 at 5:52 am
Perhaps this reading would be useful here:
http://databases.about.com/od/specificproducts/a/normalization.htm
2nd normal form is what you need to solve the problem.
All the name variations must be removed to a separate table.
_____________
Code for TallyGenerator
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply