Query Question

  • I have worked out a process that incorporates data from our company sources into a Vendor Table and incorporates information from Regulatory agencies (OFAC, DPW, etc...) into the Regulatory table.  I have to do this each month.  Let’s say that I have the two tables—one containing Vendor information and the other table which is the Regulatory information.  Both tables have primary keys, but they are not tax id's or social security numbers (this would be much easier if this were the case).  I wish to join these two files on the Business_Name field in each table.  In cases where the names match perfectly, there is no problem, results are returned.  The only problem with this is that these are names—you could have the following

    Vendor Table                                      Regulatory Table

    Pharmacy Shoppe                                 The Pharmacy Shoppe

    Burger King                                           Berger King

    The Drug Store                                       Drug Store, Inc.

     

    How can I run a query that would return matches such as these?  I had a suggestion to create a view that would parse out each word in the Business Name in both tables to another table--one word per field.  Then create joins that join Field1 to Field 2, Field1 to Field3, Field1 to Field 4, Field2 to Field 3, and so on.  Is there an easier way?  I would not know the max number of fields in this view, nor would I know the max number of joins.  It could change from month to month as the new data is incorporated.

    Any suggestions would be greatly appreciated.

    Thanks,

    Paula


    Paula

  • This is an interesting one...  I do not think you are going to find an "ideal" solution.  The best answer is for you to find something like a DUNS number that is unique for each business entity you are dealing with. 

    Lacking that, you might check into using some sort of soundex routine. 

    Another approach, modeled after search engine lookup algorithms, is to create an additional column in each table and attempt to normalize the business names.  For example, remove "the", "and", "inc", punctuation marks and so on.  You can then attempt to match the values but I suspect you will have limited success due to misspellings.  The next step I would take is to create the equivalent of a checksum of the normalized values in each table.  Matching on the checksums will bring you a few more matches, but again will likely not be a complete solution.  The goal in this set of steps is to reduce the set of unmatched values to a small enough number that you can finish the job manually. 

    Because this appears to be a repeated job (monthly as you mention) the right solution really is to define a unique identifier for each entity and apply that identifier consistently within your system.

    If all else fails, create a mapping table that is maintained by a low paid secretary.  This may well be the most efficient, cost-effective solution.

    Hope this helps

    Wayne

  • Unfortunately, I get the Regulatory table from government sources.  So, I am at their mercy in terms of what information I am provided. They only provide the name and address--no DUNNS number, no Tax ID no. etc...

     

    What is a soundex routine?


    Paula

  • Following is a brief snippet from the soundex topic in BOL.  in your situation, you could attempt to match the values by including something like SOUNDEX(columnA) = SOUNDEX(columnB) in the select statement.  Soundex is not a perfect solution, but may be "good enough" for your situation. 

    Wayne

    ===== snippet starts here ======

    This example shows the SOUNDEX function and the related DIFFERENCE function. In the first example, the standard SOUNDEX values are returned for all consonants. Returning the SOUNDEX for Smith and Smythe returns the same SOUNDEX result because all vowels, the letter y, doubled letters, and the letter h, are not included.

    -- Using SOUNDEXSELECT SOUNDEX ('Smith'), SOUNDEX ('Smythe')

    Here is the result set:

    ----- ----- S530  S530  (1 row(s) affected)

    The DIFFERENCE function compares the difference of the SOUNDEX pattern results. The first example shows two strings that differ only in vowels. The difference returned is 4 (lowest possible difference).

    -- Using DIFFERENCESELECT DIFFERENCE('Smithers', 'Smythers')GO

    Here is the result set:

    ----------- 4   
     
  • soundex and difference are functions to help you rank the proximity in pronunciation but I believe that you are following the wrong path!!!

    You need to establish a set of rules at both ends or the comunication WON'T work

    It will make extremely difficult task  something that could have been a simple matter

    just my $0.02

     


    * Noel

  • I completely agree with you Noel.  The right answer really is to find or define identifiers that are unique to each business entity.  All of the other possibilities that were mentioned are alternative approaches and none of them are recommended. 

    Having said that, lets assume that there are no unique IDs.  Your job depends on figuring out some approach.  How would you do it?

    Wayne

  • There are no infinite situations some are more cumbersome than others but here is what "I" would do.

    I will give a list of apporved (locally) codes/Numbers to the senders and create a lookup table with sender , code and Actual business Name.

    Then I'll make MANDATORY the code in the sender file - no code in no data in

    at import time use the lookup

     


    * Noel

  • I don't understand Noel's suggestion.  My situation is that I get files (6) from different Gov agencies--OFAC, GSA, DPW, HAP, HAP Nursing, etc... I have absolutely no control over the content of these files.  As it stands I have devised an elaborate series of DTS packages and transformations that parse this data into appropriate columns--business names vs. people's names.

    The gov. agencies do not give a unique identifier; therefore, I must do the join on the BusinessName field.  Basically, I want to ensure that I am doing due diligence with the searches as we are checking our company vendor and HR lists against these government sanctioned lists.  For example, if we get a match--this means that our employee or our vendor is on a gov sanctioned list and we may have to terminate our relationship with the vendor or employee.  We are a large healthcare provider and can not technically bill certain gov agencies for services if you are in a relationship with a sanctioned ee or vendor.


    Paula

  • >>I have absolutely no control over the content of these files<<

    If you can't establish rules around what you recieve from those agencies it will be extremelly difficult to build a successful solution!

    My point is that you should try to get them to send you a code an identifier or something else that is not just a typed thing.

    - If you can't  then get a dataentry person because otherwise you will be called every time an import performed the wrong or not match at all. (and I guarratee you, it will be very very frequently )

     


    * Noel

  • Paula,

    Thanks for the additional background.  This helps frame the problem a little better.

    Can you provide an idea of the volume you are dealing with on a monthly basis?  For example, are you looking at a total of 10 companies on the list?  1,000?  10,000?

    I understand that you will not get unique identifiers from the government.  Can you define a table that will be updated monthly by adding any new business names?  If so then you can assign your own unique internal identifier to each business when it arrives.  The first time this table is populated you will have many business to deal with.  However, after the initial load, I suspect you will less than 2 % increase in size per month (I used to live in your world).

    The table might look something like this

    create table TestTbl (

    InternalBusinessID        INT IDENTITY(1,1),

    OriginalBusinessName      NVARCHAR(1000),

    InternalBusinessName      NVARCHAR(1000)

    )

    The primary key woul dbe the combination of OriginalBusinessName and InternalBusinessName. InternalBusinessName should be unique.

    Define a process for creating an InternalBusinessName.  This process will likely be manual the first time, and may not ever be automated due to the issues discussed in other posts.  By applying the same rules to your vendor tables, you will be able to match the InternalBusinessName to your existing trading partners and vendors.

    All of the options listed above may be applied to creating the InternalBusinessName for future matching.  However, there are a number of commercial packages and businesses that make a fair amount of money by solving parts of this problem.  I am not aware of any solution that can solve this problem in a completely automated manner.

    The summary version of this post is to create a mapping table, probably maintained with some amount of manual effort, and use that mapping table in your other queries.

    Havea good day

    Wayne

  • Slacker,  It sounds like you problem relates to establishing business rules (which have been ok’d by your legal department) rather than a problem that can be solved by SQL.

    Let me restate your problem to see if I understand what you need. Various government agencies send you lists of companies. To conform to government regulations you cannot bill for some services if you have a relationship with any of these companies. You want to make sure that you satisfy the requirements under due diligence. (Due Diligence is the care that a reasonable person exercises under the circumstances to avoid breaking the rules or causing harm. )

    The type and form of the information sent by the agencies varies by agencies and you have no control over the way data is received. You also have no way of knowing if data that appears to be input correctly is actually an error. There could be a business named Berger King on their list. Yours is not to reason why, just work with what they give you. Unless you get a hit between your db and theirs there is no problem. I would establish business rules that state that no vendor will be dropped for a conflict of interest until a through check has been made with all parties. ( i.e. When you get a hit have someone check it out, other wise forget it)  

    A business maxim is that you cannot attach responsibility to actions over which the person has no control. I would check with the legal department and ask them if “due diligence” is met when you have searched on the information provided without trying to scrub the government’s data. (they probably would not make the corrections even if you sent them the information) GIGO, garbage in garbage out applies to the government just like everyone else.

    If the legal department wants you to go further use a soundex utility. You will get more hits which will have to be manually checked.

    Mike

    edited to add the last paragraph sloppy cut and paste on my part

  • Slackr,

    You may create a word matching function as below and use the count of words matching to join the tables.

    alter function Matching_Words_Count(@s1 varchar(1000), @s2 varchar(1000))

    Returns int

    Begin

    Declare @word varchar(50)

    Declare @ctr int, @match int

    Select @s1 = Ltrim(Rtrim(@s1)), @s2 = Ltrim(Rtrim(@s2)), @word = '', @match=0, @ctr=1

     While @ctr<=len(@s1)

     Begin

      --Check for space

         If(Substring(@s1,@ctr,1)=char(32))

         Begin

              While(Substring(@s1,@ctr,1)=char(32)) Set @ctr = @ctr + 1

              If @word NOT IN ('the','inc.,','ltd','co.,','of')

             AND charindex(@word,@s2)> 0

              Set @match=@match+1

              Set @word = ''

         End

         Set @word = @word + Substring(@s1,@ctr,1)

         Set @ctr = @ctr + 1

     End

    If @word NOT IN ('the','inc.,','ltd','co.,','of')  If charindex(@word,@s2)> 0 Set @match=@match+1

    Return @match

    End

    --Test the function

    select dbo.Matching_Words_Count('the american drug store ','the drug store inc')

    returns 2

     

    You may fix the word count to 2 or 3 and the match the tables as :               

    Select [Vendor Name],[Regulatory Name] From [Vendors] V LEFT OUTER JOIN [Regulatory] R ON dbo.Matching_Words_Count(v.[vendor name],R.[regulatory Name]) >=2

    Hope it serves your purpose.

    Regards,

    Hari

     

     


    Kindest Regards,

    Hari

  • I agree with Mike's statements. You cannot assume Berger King and Burger King are the same company.

    I have seen on a few occasions that names can look similar for common compnay types but actually be different companies.

    I suggest as well you point these issues out to your legal department and allow them to determine what course of action has to be taken and suggest they try to give a push on the source provider.

    If they however say you have to work with as is then I would create a table for confirmed misspellings to correct spelling. As you find items without a match check throughly that it is a misspelling and add it to your table to increase matching results.

    You might even cross check the data itself as you stated there are addresses included, check unknown spellings addresses to known addresses and if a match consider confirmed and put into the bridge table.

    All and all thou you might have to ask for a contract person to assist until you get all items mapped in and have an acceptable level of lookups required. I would also suggest you create a table for spellings that have been confiormed outside scope to eliminate those that would not apply and prevent the need to recheck.

    Assume nothin until you physically check and make sure your legal dept is aware so they can make plans on how to deal with any discrepancies due to the poor quality of the source data.

  • FWIW, here are some links:

    http://www.microsoft.com/india/msdn/articles/116.aspx

    http://www.codeproject.com/database/dmetaphone4.asp

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply