Merge similar records from excel

Question

Merge similar records from excel

reamades

SSC Rookie

Points: 30
More actions
May 19, 2015 at 10:24 am

#298499

I am working on a project that will require me to get a flat data file (excel spreadsheet) with hundreds of thousands of records. Each record is an Owner, and specifically what they own. There will be a field for OwnerName that I want to figure out a way to pull the data into a database like;
Table(Owners) - make sure owner is listed only once
Table(Properties) - joined to owners showing all properties that person owns
Now the tricky part, the owner names might not be exactly the same. Some records might have;
Smith, John
John Smith
Smith, John T
etc.
To make matters worse, this will be a continuous process. I will receive updated excel spreadsheets from time to time and will need to import the new records, many times overwriting the old data. For the good news, there should be an OwnerID that will be unique within the excel data. So as I am merging similar records into the Owners table, I should have a list of OwnerID's that can forever be used to link to the owner.
Any thoughts on the best process for something like this?

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323518 More actions · Answer 1

i think you need to provider more information.

it is certainly possible that two people have the same name...how do you know John Smith is the same or a different, new John Smith?

do you have better elements for uniqueness, like phone number or email address, for example?

you said you will your source have a uniqueness value, like the OwnerID in your Source excel, so what's wrong with changing the name to the last value received?

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

reamades SSC Rookie Points: 30 More actions · Answer 2

Sure, the flat data file has;

OwnerID

OwnerName

OwnerAddress

OwnerCity

OwnerState

OwnerZip

PropertyDescription

County

etc.

So John Smith might have OwnerID=1 in Tarrant County. And there might be Smith, John in Reno County that has OwnerID=205. I'm still wrapping my head around the best way to accomplish this, but I'm thinking I might want to find something like owner name is similar AND zip code is the same. I know it's not perfect but it should be fairly accurate in identifying the same person. I could also use owner name is similar AND address is similar. I would have to use a similar logic as I know the data will not be character for character the same.

So, any ideas on how to perform this sort of data organization?

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 3

reamades (5/19/2015)
Sure, the flat data file has;
OwnerID
OwnerName
OwnerAddress
OwnerCity
OwnerState
OwnerZip
PropertyDescription
County
etc.
So John Smith might have OwnerID=1 in Tarrant County. And there might be Smith, John in Reno County that has OwnerID=205. I'm still wrapping my head around the best way to accomplish this, but I'm thinking I might want to find something like owner name is similar AND zip code is the same. I know it's not perfect but it should be fairly accurate in identifying the same person. I could also use owner name is similar AND address is similar. I would have to use a similar logic as I know the data will not be character for character the same.
So, any ideas on how to perform this sort of data organization?

Is the "OwnerID" any good? The way your'e talking about doing it, could (and most likely will) cause you to group multiple distinct individuals.

When I've had to do this in the past, I used a point system... If name was a match, N points, if address was a match, N points... At the end, add up the points. If the total meets or exceeds a threashold, they are deemed a match. If not, they are regarded as separate individuals.

As far as how to distribute the data... Start by landing the into a single staging table and then parse it out to a normalized table structure from there.