pattern matching?

Question

Post reply

pattern matching?

RightOnTarget

Hall of Fame

Points: 3041
More actions
December 12, 2012 at 4:41 pm

#271627

Hi all,
I have a task to clean up data in one of the tables. The column name I need to clean up holds business names, but they can appear in there in many different ways.
For instance:
Costco
COSTCO
Costco Whls
Costco Wholesale
Costco Whls llc
What is available to me in SQL Server 2008R2 that can help me to accomplish that? How would you approach it?
Thanks,

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

roryp 96873 SSCertifiable Points: 6649 More actions · Answer 1

If you're looking for every instance of a particular character string, you can use this:

select *

from mytable

where bus_name like '%costco%'

RightOnTarget Hall of Fame Points: 3041 More actions · Answer 2

Thanks for reply.

The table that holds business names has 20M rows.

I can't specify a business name because I don't know it or will have to do it for every business name in the table.

In the following example, how would you use "like"?

Costco

Costco LLC

Costco Whls

Home Interiors Malaga

Home Plumbing

Home Property Management

Home Realty

Home Svc

roryp 96873 SSCertifiable Points: 6649 More actions · Answer 3

eugene.pipko (12/12/2012)
Thanks for reply.
The table that holds business names has 20M rows.
I can't specify a business name because I don't know it or will have to do it for every business name in the table.
In the following example, how would you use "like"?
Costco
Costco LLC
Costco Whls
Home Interiors Malaga
Home Plumbing
Home Property Management
Home Realty
Home Svc

I guess I'm not quite sure you are looking for in this example, are all those business names considered to be the same for this case?

RightOnTarget Hall of Fame Points: 3041 More actions · Answer 4

I am looking for a way to say:

Based on the list here:

---------------------

Costco

Costco LLC

Costco Whls

Home Interiors Malaga

Home Plumbing

Home Property Management

Home Realty

Home Svc

These are unique business names:

---------------------

Costco

Home Interiors Malaga

Home Plumbing

Home Property Management

Home Realty

Home Svc

David Webb-CDS SSCoach Points: 17398 More actions · Answer 5

Do you have anything else in the row that might help to find duplicates? Address? Phone? DUNS number? federal tax id?

What makes a business unique? Costco has a lot of stores. Is each one unique or should there only be one row for the parent company alone?

If you have only the name to go on, I'd recommend you hire one of the services who do this for a living to help clean up your data. The rules for this are extremely complex and most folks who do this don't guarantee that they will ever get to a 100% cleanup. Dun and Bradstreet has a service for this (I don't work for them) and I'm sure there are others.

And then again, I might be wrong ...
David Webb

RightOnTarget Hall of Fame Points: 3041 More actions · Answer 6

David,

I don't have any other supporting data. What I have is inconsistent.

In the costco example, it should be one parent company, not multiple stores.

Thanks,

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 7

SELECT SOUNDEX('Costco'),

SOUNDEX('COSTCO'),

SOUNDEX('Costco Whls'),

SOUNDEX('Costco Wholesale'),

SOUNDEX('Costco Whls llc');

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Lowell SSC Guru Points: 323755 More actions · Answer 8

wow, thats going to be tough;

the only thing i could think of was a combination of opc.three's example, and joining it against a list of common suffixes to find potential duplicates, but that of course is going an ongoing thing as you dig deeper into the data.

something like this is what i thought might be a starting point:

With MySampleData(CompanyName)

AS

(

SELECT 'Costco' UNION ALL

SELECT 'Costco LLC' UNION ALL

SELECT 'Costco Whls' UNION ALL

SELECT 'Home Interiors Malaga' UNION ALL

SELECT 'Home Plumbing' UNION ALL

SELECT 'Home Property Management' UNION ALL

SELECT 'Home Realty' UNION ALL

SELECT 'Home Svc'

),

CommonSuffixes (val)

AS

(

SELECT ' Inc' UNION ALL

SELECT ' LLC' UNION ALL

SELECT ' Company' UNION ALL

SELECT ' Co'

)

SELECT

ROW_NUMBER() OVER (PARTITION BY SOUNDEX(CompanyName) ORDER BY CompanyName) AS RW,

SOUNDEX(CompanyName) AS SoundX,

*

FROM MySampleData

LEFT OUTER JOIN CommonSuffixes

ON CHARINDEX(CommonSuffixes.val,MySampleData.CompanyName) > 0

--WHERE CommonSuffixes.val IS NOT NULL --(turns the LEFT OUTER into an inner join, i know)

ORDER BY CompanyName,RW

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 9

Without artificial intelligence which is on pair with clear human knowledge which names refer to the same company and which one, even very similar ones, are not, it is simply impossible to do what you want in a plain coding (regardless of programming language).

That is data cleansing exercise and it will always require some manual intervention.

I can only suggest couple of ways:

Use SSIS, there is a Fuzzy Grouping transformation which is designed primary for the data cleansing tasks.

Create a database of company names variations.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Abu Dina SSChampion Points: 14155 More actions · Answer 10

You may want to check this link: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

There are various implementations of this algorithm in T-SQL and CLR but should be easy to Google for a readymade function.

I've not tested this on many company names but it seems to return better matches as below:

If you need any further help then let me know.

Good luck!

---------------------------------------------------------

It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens[/url]

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn

RightOnTarget Hall of Fame Points: 3041 More actions · Answer 11

Thank you all for replies,

It's an interesting problem and I thought it would be fun to try taking a crack at it, but I agree with you, Eugene that it will take manual work no matter what.

And since a company table has 17M rows, not sure how long it may take and if it's worth it.

Thanks again,