August 27, 2003 at 1:16 pm
Hi,
I currently use some software which identifies duplicate records for me based on Initial, Surname, Property and Postcode so that I don't have to worry about doing it. The way that it identifies duplicate records is by entering a value in a field called dup_code. The problem is that it's not that clever when you have for example the following records all at the same address:
S Andrews
Sarah Andrews
Sharon Andrews
The software will mark all these records as duplicates because the first letter of the "Forename" field is "S". The software identifies the records as duplicates in the following way:
Forename Surname Dup_Code
S Andrews 1
Sarah Andrews 1
Sharon Andrews 1
Paul Rowling 2
P Rowling 2
Ozzie Osborne 0
In this example any record that does not have a duplicate record is always given a dup_code of 0. All other records are given a unique dup_code which will match with an associated duplicate record.
Because there is no way to identify if S Andrews is a duplicate with Sarah or Sharon, I want to update their dup_code to 0, but leave all others intact.
You may ask why I want to do this? Well, the DB that I have stores customer info in a customer table and transactions (such as what the customer bought, when they bought it and how much) in a transactions table. Therefore I want to be able to identify duplicate customer records, transfer the transactions of the duplicate person I want to delete to the person I want to keep and then delete the duplicate person. In the instance of the "Andrews" example above, I don't know which person to transfer the transactions to and so because I cannot do this accurately I will just take a hit and leave them as they are.
The actual field definitions are as follows:
Customers.forename
Customers.surname
Customers.dup_code
Any help would be greatly appreciated.
Cheers
Paul
August 27, 2003 at 2:23 pm
I think you'll have to use more than just the customers first and last names to cull duplications out.
I would think that it would be very easy to have two distinct customers with the same name, but living in two different places.
Can you give us the structure of the customer table, and if it links to an address table, that too?
I'm bet that with the customer name, zipcode, and email all combined, you can find the real duplicates.
Lowell
Lowell
August 27, 2003 at 3:35 pm
Hi Lowell,
The customer table actually has all the address fields in there. The software that I use finds duplicates based on the customers initial, surname, property and postcode. The reason I need to id duplicates via inital is because the data I receive from shops has numerous duplicates whereby a customer record has beeen entered twice such as P Rowling, and Paul Rowling. These can be assumed to be the same person if they are at the same address, which my software does for me. The problem I have is that the software identifies the following example as duplicates:
P Rowling
Paul Rowling
Peter Rowling
The above cannot be assumed because P Rowling could be the same person as Paul or Peter. In this case I want to say that none of the records are duplicates.
The fields of interest in the customer table are:
forename
surname
dup_code
property
street
locality
town
county
postcode
Thanks for your help
Paul
August 28, 2003 at 7:56 am
If you had a CustomerNo as the primary key in your table and accept user input via the CustomerNo, you wouldn't have this problem.
What logic would you use to know if Sarah Andrews or Sharon Andrews is a duplicate of S Andrews? You need a manual intervention to do that. I don't envy your job!
Joseph
August 28, 2003 at 9:10 am
This would give you the customers that are in this situation :
SELECT cust.* FROM Customers cust
WHERE cust.dup_code = 1
AND EXISTS
(SELECT 1 FROM Customers c2
WHERE c2.surname = cust.surname
AND LEFT(c2.forename,1) = left(c2.forename,1)
GROUP BY c2.surname, left(c2.forename,1)
HAVING COUNT(*) > 2)
August 28, 2003 at 3:29 pm
Just for information, we receive the data from a shop till system whereby the primary key is a concatenation of the customers forename, surname and date of birth. This primary key is pretty useless as the DOB is rarely recorded properly, hence you can see why so many duplicates can arise.
With regards to Sarah Andrews, Sharon Andrews and S Andrews this exactly my point. You cannot identify if S Andrews is a duplicate with Sarah or Sharon, I agree this would require manual intervention which is not feasible therefore I want to flag them as non duplicates.
Cheers
Paul
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply