De-Dupe

Question

Post reply

De-Dupe

DBABC

SSCarpal Tunnel

Points: 4813
More actions
October 26, 2009 at 8:53 am

#136644

Whats the best way to find the duplicate data in a sql server table?
I am using SOUNDEX on Company Name, but it does not work best all the time. What's the alternates? Does anyone know about any third part tools?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

steveb. SSC-Forever Points: 46733 More actions · Answer 1

I like to use T-SQL with ROW_COUNT()

do you have a unique key that identifies the row?

Can you provide some more details about the table and data?

Steve Jones - SSC Editor SSC Guru Points: 736468 More actions · Answer 2

Likely you would need multiple passes to find typos. Duplicates would be exact matches. typically that's a

select dupfield

, count (*)

from mytable

group by dupfield

having count(*) > 1

If you are looking for close entries that are typo'd, soundex might work, but you'd likely need a loose filter and then manually clean things up.

DBABC SSCarpal Tunnel Points: 4813 More actions · Answer 3

I have a company table containing all the information about different companies (name, contact, address etc).

I want to identify the closely matching companies based on company name.

e.g i want to identify "White horse Company" & "White horse Co."

Currently i am using SOUNDEX and it works well but not a best option as at times it gives weird matching results.

smeet.sinha Mr or Mrs. 500 Points: 577 More actions · Answer 4

U can't do it without having a certain rule defined to the search criteria, if u want to do it through coding.

Like:-

White horse Company White horse Company match

White horse Company White whore Company not a match

White horse Company White horse Co match

White horse Company White horse Organization match

White horse Company Horse White Company match

White horse Company Horse White Organization match

Above mentioned rules are possible to implement in the coding.

Or you can try SSIS for fuzzy lookup

Ed Zann SSCrazy Eights Points: 8694 More actions · Answer 5

I would suggest the "Fuzzy Match" feature in SSIS. I have used it a couple of times to cross reference data between different systems and it has been very useful.

DBABC SSCarpal Tunnel Points: 4813 More actions · Answer 6

Ed-86789 (10/28/2009)
I would suggest the "Fuzzy Match" feature in SSIS. I have used it a couple of times to cross reference data between different systems and it has been very useful.

Thanks, Can you suggest a good article explaining how it works and how to use it?

Ed Zann SSCrazy Eights Points: 8694 More actions · Answer 7

Don't know of an article, but if you type "Fuzzy Lookup Transformation " in BOL it will describe it. Assuming you're familiar with SSIS, this is a component in a Data Flow. You supply an input result set, such as a table or query, and the lookup table that you want to match against ( in your case the same table - I think that should be possible - or a copy of it ). Then you designate which fields you want to match.

It will output another result set containing data from the original and lookup sets and a ranking of similarity. You could use that ranking as a basis for de-duping. You would need to throw out the 100% matches of the record to itself.

DBABC SSCarpal Tunnel Points: 4813 More actions · Answer 8

DBABC

SSCarpal Tunnel

Points: 4813

October 28, 2009 at 2:13 pm

#1072058

Thanks for your help, reading BOL now.