October 26, 2009 at 8:53 am
Whats the best way to find the duplicate data in a sql server table?
I am using SOUNDEX on Company Name, but it does not work best all the time. What's the alternates? Does anyone know about any third part tools?
October 26, 2009 at 8:55 am
I like to use T-SQL with ROW_COUNT()
do you have a unique key that identifies the row?
Can you provide some more details about the table and data?
October 26, 2009 at 8:59 am
Likely you would need multiple passes to find typos. Duplicates would be exact matches. typically that's a
select dupfield
, count (*)
from mytable
group by dupfield
having count(*) > 1
If you are looking for close entries that are typo'd, soundex might work, but you'd likely need a loose filter and then manually clean things up.
October 26, 2009 at 9:48 am
I have a company table containing all the information about different companies (name, contact, address etc).
I want to identify the closely matching companies based on company name.
e.g i want to identify "White horse Company" & "White horse Co."
Currently i am using SOUNDEX and it works well but not a best option as at times it gives weird matching results.
October 27, 2009 at 7:56 am
U can't do it without having a certain rule defined to the search criteria, if u want to do it through coding.
Like:-
White horse Company White horse Company match
White horse Company White whore Company not a match
White horse Company White horse Co match
White horse Company White horse Organization match
White horse Company Horse White Company match
White horse Company Horse White Organization match
Above mentioned rules are possible to implement in the coding.
Or you can try SSIS for fuzzy lookup
October 28, 2009 at 9:02 am
I would suggest the "Fuzzy Match" feature in SSIS. I have used it a couple of times to cross reference data between different systems and it has been very useful.
October 28, 2009 at 1:11 pm
Ed-86789 (10/28/2009)
I would suggest the "Fuzzy Match" feature in SSIS. I have used it a couple of times to cross reference data between different systems and it has been very useful.
Thanks, Can you suggest a good article explaining how it works and how to use it?
October 28, 2009 at 1:31 pm
Don't know of an article, but if you type "Fuzzy Lookup Transformation " in BOL it will describe it. Assuming you're familiar with SSIS, this is a component in a Data Flow. You supply an input result set, such as a table or query, and the lookup table that you want to match against ( in your case the same table - I think that should be possible - or a copy of it ). Then you designate which fields you want to match.
It will output another result set containing data from the original and lookup sets and a ranking of similarity. You could use that ranking as a basis for de-duping. You would need to throw out the 100% matches of the record to itself.
October 28, 2009 at 2:13 pm
Thanks for your help, reading BOL now.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply