weekly Duplicate checking

Question

Post reply

weekly Duplicate checking

Neuro

Ten Centuries

Points: 1063
More actions
February 14, 2006 at 7:30 am

#113393

Im looking at ways of keep duplicates on our system to a low. one way that I wanted to try was to match all the data entered in a week to the current data. and then score the data each time.
For example my query below is what ive started but realise its much harder and involved at teh same time.
SELECT CX.ROW_ID, CX.FST_NAME, CX.LAST_NAME, F.NAME ,'PARENT' =CASE WHEN CX.CREATED > '2006-02-01 00:00:00.000' THEN 'X' ELSE '' END
FROM S_CONTACT CX
JOIN FIRM F ON F.ID = CX.FIRM_ID
              WHERE lower(CX.LAST_NAME) in (
                      SELECT Lower(C.LAST_NAME)
                      FROM CONTACT C
                      WHERE C.CREATED>'2006-02-01 00:00:00.000'
                      AND C.STATUS <> 'Marked For Archive')
AND CX.STATUS <> 'Marked For Archive'
Order by CX.LAST_NAME ASC,CX.CREATED desc, CX.FST_NAME
This gives me a nice list of Contacts listing the newest one at the top and ordering the rest in a relatively easyt to use list
Id like this to be a little smarter , ie showing only where
A) the initial is the same
B) if the firm name is similar (i wanted to do some kind of score based on the characters in the firm names

any ideas on how I could do this

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

AMIT GUPTA-263376 SSCarpal Tunnel Points: 4244 More actions · Answer 1

To find out the duplicates names you should use this query:-

select CX.FST_NAME+CX.LAST_NAME,count(*) num

from table_name

group by CX.FST_NAME+CX.LAST_NAME

having count(*) > 1

JeffB Hall of Fame Points: 3541 More actions · Answer 2

Try looking at the SOUNDEX and DIFFERENCE functions.

--Jeff

Neuro Ten Centuries Points: 1063 More actions · Answer 3

Thanks Amit unfortunately it never is that easy, suppose you have two people entered at the Smith & Co Company

The First is Abbey Smith, the second is A B Smith

This would mean that your query wouln't see that

Further to the query I posted though I was looking at Just joining as a subquery but I can't do that either as if someone is entered on to the database and someone else with the surname in the week, the report will return 2x the number of people sharing that name

Jeff, thanks for your note, on a much early post I did include Soundex and difference as part of the query but that seemed to hit the CPU much harder as the Names will have non Alpha numeric characters and inorder for the Query to work positively I would first strip down all the Characters in the Last/First name

and then when it scans the whole contact table I will need to remove the characters from each of the names in turn inorder to ensure that the SOUNDEX / DIFFERENCE worked

JeffB Hall of Fame Points: 3541 More actions · Answer 4

One thing to do is to store the SOUNDEX of each row as a column. Also, store the stripped down names as colums. It's easier to amortize the cost of manipulating each name over each insert/update than doing it everytime a query is done for all records.

--Jeff

Neuro Ten Centuries Points: 1063 More actions · Answer 5

Intresting, how do you mean store the value of soundex as a column ?(or did you mean store the value in a column)

unfortunately our DBA doesn't like us writing to the database unless its through the front end so just adding a column would take at least 2 months before it actually gets done.

time to create everything in excel then

JeffB Hall of Fame Points: 3541 More actions · Answer 6

What I mean is that as part of the code that inserts a new customer, populate a column with the SOUNDEX of the stripped down customer name. An example script would be:

create table Customer

(

CustomerID int identity(1,1),

--REST OF CUSTOMER FIELDS

CustomerSoundex char(4)

)

On the insert:

insert customer (...column declaration...)

values (...rest of customer data..., SOUNDEX(dbo.fnStripCustomerName(...))

You would also have to write the function dbo.fnStripCustomerName() to take the customer name arguements and strip and concatentate them according to the rule you mentioned.