Identifying Duplicates (Not the usual)

Question

Identifying Duplicates (Not the usual)

rozza

SSC Enthusiast

Points: 167
More actions
August 27, 2003 at 1:16 pm

#80872

Hi,
I currently use some software which identifies duplicate records for me based on Initial, Surname, Property and Postcode so that I don't have to worry about doing it. The way that it identifies duplicate records is by entering a value in a field called dup_code. The problem is that it's not that clever when you have for example the following records all at the same address:
S Andrews
Sarah Andrews
Sharon Andrews
The software will mark all these records as duplicates because the first letter of the "Forename" field is "S". The software identifies the records as duplicates in the following way:
Forename Surname Dup_Code
S Andrews 1
Sarah Andrews 1
Sharon Andrews 1
Paul Rowling 2
P Rowling 2
Ozzie Osborne 0
In this example any record that does not have a duplicate record is always given a dup_code of 0. All other records are given a unique dup_code which will match with an associated duplicate record.
Because there is no way to identify if S Andrews is a duplicate with Sarah or Sharon, I want to update their dup_code to 0, but leave all others intact.
You may ask why I want to do this? Well, the DB that I have stores customer info in a customer table and transactions (such as what the customer bought, when they bought it and how much) in a transactions table. Therefore I want to be able to identify duplicate customer records, transfer the transactions of the duplicate person I want to delete to the person I want to keep and then delete the duplicate person. In the instance of the "Andrews" example above, I don't know which person to transfer the transactions to and so because I cannot do this accurately I will just take a hit and leave them as they are.
The actual field definitions are as follows:
Customers.forename
Customers.surname
Customers.dup_code
Any help would be greatly appreciated.
Cheers
Paul

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323518 More actions · Answer 1

I think you'll have to use more than just the customers first and last names to cull duplications out.

I would think that it would be very easy to have two distinct customers with the same name, but living in two different places.

Can you give us the structure of the customer table, and if it links to an address table, that too?

I'm bet that with the customer name, zipcode, and email all combined, you can find the real duplicates.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

rozza SSC Enthusiast Points: 167 More actions · Answer 2

Hi Lowell,

The customer table actually has all the address fields in there. The software that I use finds duplicates based on the customers initial, surname, property and postcode. The reason I need to id duplicates via inital is because the data I receive from shops has numerous duplicates whereby a customer record has beeen entered twice such as P Rowling, and Paul Rowling. These can be assumed to be the same person if they are at the same address, which my software does for me. The problem I have is that the software identifies the following example as duplicates:

P Rowling

Paul Rowling

Peter Rowling

The above cannot be assumed because P Rowling could be the same person as Paul or Peter. In this case I want to say that none of the records are duplicates.

The fields of interest in the customer table are:

forename

surname

dup_code

property

street

locality

town

county

postcode

Thanks for your help

Paul

fromnaija SSCrazy Points: 2876 More actions · Answer 3

If you had a CustomerNo as the primary key in your table and accept user input via the CustomerNo, you wouldn't have this problem.

What logic would you use to know if Sarah Andrews or Sharon Andrews is a duplicate of S Andrews? You need a manual intervention to do that. I don't envy your job!

Joseph

NPeeters SSChampion Points: 12535 More actions · Answer 4

This would give you the customers that are in this situation :



SELECT cust.* FROM Customers cust

WHERE cust.dup_code = 1

AND EXISTS

(SELECT 1 FROM Customers c2

 WHERE c2.surname = cust.surname

   AND LEFT(c2.forename,1) = left(c2.forename,1)

   GROUP BY c2.surname, left(c2.forename,1)

   HAVING COUNT(*) > 2)

rozza SSC Enthusiast Points: 167 More actions · Answer 5

Just for information, we receive the data from a shop till system whereby the primary key is a concatenation of the customers forename, surname and date of birth. This primary key is pretty useless as the DOB is rarely recorded properly, hence you can see why so many duplicates can arise.

With regards to Sarah Andrews, Sharon Andrews and S Andrews this exactly my point. You cannot identify if S Andrews is a duplicate with Sarah or Sharon, I agree this would require manual intervention which is not feasible therefore I want to flag them as non duplicates.

Cheers

Paul