July 14, 2010 at 8:18 am
I would like to create a function that performs several replaces.
Our collation is case insensitive.
I want to compare two different records to try to determine if they are duplicates. Names and addresses often have different punctuation. I'd like to strip that out so that when I compare two names or addresses, I will find more matches. So O'Malley becomes OMalley, Mary Ann becomes MaryAnn, etc. I'd also like to change letters with tildes, accents, etc. to letters without them. What would be an efficient way to code this? Here's what I've done so far.
CREATE FUNCTION [dbo].[WRT_simple_name]
(
@LAST_NAME char(30)
)
RETURNS char(30)
AS
BEGIN
-- Declare the return variable here
DECLARE @L_NAME char(30)
-- Add the T-SQL statements to compute the return value here
Set @L_NAME =
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(rtrim(@LAST_NAME),'''','')
,'.','')
,' ','')
,'-','')
,'_','')
,'~','')
,'ž','z')
,'ó','o')
,'š','s')
,'í','i')
,'é','e')
-- Return the result of the function
RETURN @l_name
END
I do not intend to ACTUALLY replace the data, I just want this to have more "hits" on comparisons.
July 14, 2010 at 8:25 am
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
July 14, 2010 at 9:38 am
July 14, 2010 at 10:23 am
rdouglas66 (7/14/2010)
Have you tried doing comparisons using the SOUNDEX function?
I have tried, have you? I don't think it will help much in what the OP trying to do...
Soundex is just a pity attemp of "fuzzy" search
Soundex('LLoyds') ! = Soundex('Loyds')
but
Soundex('Brighton') = Soundex('Bristol')
No much use!
July 14, 2010 at 10:24 am
Thanks for the suggestions.
While I don't dispute that CLR functions in C# are better, I am currently restricted to TSQL.
I tried using the soundex function and there were too many false positives.
July 14, 2010 at 10:37 am
Julie try this function agaisnt your data.
this strips out everything that is not a number and not a character...so spaces, control characters, brackets or whatever get stripped out;
that seems to be what you are after for your comparison you want to try.
you might need to adapt it to substitute ,'ó','o', as it currently strips out the ,'ó'
it's a scalar function, so it's not the fastest, but this sounds like a one time cleanup, and not something that needs to be converted to an ITVF:
usage :
select dbo.StripNonAlphaNumeric(FieldName) ,FieldName from SomeTable
CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000)
;WITH tally (N) as
(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii upper case letters A-Z is 65 thru 90
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii lower case letters a-z is 97 thru 122
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
FROM tally WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
Lowell
July 14, 2010 at 12:39 pm
That looks promising, Lowell. I'd need to add a piece for special letters like those with accents, umlauts, etc. I've used tally tables before, thanks to Jeff Moden's well-written article, so I understand how this works.
It's not something for "production" but we do occasionally get large files (50,000 records) to add to our records so avoiding duplication is important.
Thanks for your suggestion.
July 14, 2010 at 12:56 pm
Joe Celko (7/14/2010)
SQL is not a text language. Get a copy of Mellisa Data or other address data scrubbing tools. They will get you to CASS and 9-1-1 Standards much quicker than trying to learn then code them in your own SQL.
nah, if you are familiar with any tool, whether SQL or some programming language, there's no reason to buy/load/learn another application when you can use the existing tools to do the same work.
regardless of whether SQL is a "text" language, the job can be done there fast and easy, leaving the resulting data in a place where you can use it or update it;
external tools might scrub the data better , but the learning curve to use them, plus an additional cost, would not be worth it for the once in a while check for misspelled duplicates.
Lowell
July 15, 2010 at 12:04 am
Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
Maybe in this case... maybe not. I've beaten certain CLR functions.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 15, 2010 at 12:30 am
Julie Breutzmann (7/14/2010)
That looks promising, Lowell. I'd need to add a piece for special letters like those with accents, umlauts, etc. I've used tally tables before, thanks to Jeff Moden's well-written article, so I understand how this works.It's not something for "production" but we do occasionally get large files (50,000 records) to add to our records so avoiding duplication is important.
Thanks for your suggestion.
Hi Julie,
First, thanks for the compliment. Glad to have been of service.
Second, I agree with Joe Celko on this one. CASS Certification programs really don't cost that much (less than you'll spend trying to write your own code to some point of perfection) with the side effect of having fulling validated addresses. In other words, you'll not only be able to search for dupes, but you also be able to find out where customers have 1) flat out lied or 2) don't really know what their correct address is (and it happens more than you can believe) or 3) have phat phingered things so badly that there's no chance of determining if there's a dupe or not.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 15, 2010 at 7:17 am
I would love to purchase such a program. I have one in mind that is partnered with our vendor and would provide numerous benefits to our institution, but it's not gonna happen this year. I work in higher ed and our budget has been cut significantly. To make matters worse a significant portion of our budget is non-discretionary. 🙁
July 15, 2010 at 2:05 pm
Julie Breutzmann (7/15/2010)
I work in higher ed and our budget has been cut significantly. To make matters worse a significant portion of our budget is non-discretionary. 🙁
Damn... I'll never understand politicians. :crazy:
Does Lowell's function do the trick for you? I'm a little concerned about performance there because of the scalar nature of the function but I won't be near a machine with T-SQL on it until Monday so I can't do much about it for demonstration purposes.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 15, 2010 at 2:25 pm
I work at a private college, but it's still politics to some degree. Even though my department head explained this to the cabinet, I'm not sure that they understand or maybe protecting their own budgets mattered more to them.
I haven't had a chance to play with Lowell's suggestion, yet. Things have been too hectic and though important, it's currently not a high priority. I will certainly post to let people know how this works for me.
Julie
July 16, 2010 at 10:44 am
I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.
This should be useable for files this size or smaller. Not so sure about larger ones, even though we only deal with large ones a few times a year. I may consider adding a table to the database with the id, and the converted first name, last name, and address line 1, with triggers to keep it up-dated. That way the function only needs to run on the (relatively) smaller input file and should be significantly faster. I probably won't get to this for a bit.
Here's the function I ended up with.
CREATE FUNCTION wrt_Simple_name_2(@OriginalText VARCHAR(100))
RETURNS VARCHAR(100)
BEGIN
DECLARE @CleanedText VARCHAR(100)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 48 AND 57
THEN SUBSTRING(@OriginalText,wrt_tally.N,1)
--ascii upper case letters A-Z is 65 thru 90
WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 65 AND 90
THEN SUBSTRING(@OriginalText,wrt_tally.N,1)
--ascii lower case letters a-z is 97 thru 122
WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 97 AND 122
THEN SUBSTRING(@OriginalText,wrt_tally.N,1)
--ascii "foreign" letters are 192 - 240
WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 193 AND 254
THEN table_detail.TBL_VALUE_ALT1
ELSE '' END
FROM wrt_tally
left outer join table_detail on ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) = table_detail.table_value
and table_detail.column_name = 'wrt_cm_letter_translation'
WHERE wrt_tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
A few comments:
Any custom tables, views, functions, etc. need to start with 'wrt' per the rules of our database vendor.
wrt_tally is our custom tally table.
table_detail is a code table that I used to "convert" letters such as "a umlaut" to "a".
I very much appreciate all the assistance offered.
Julie
July 16, 2010 at 1:13 pm
Julie Breutzmann (7/16/2010)
I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.
That's what I was afraid of. There is a way to do this same thing using an iTVF (inline Table Valued Function) and a Cross Apply that should speed things up by (possibly) and order of magnitude. The problem is that I won't be at a machine with SQL Server on it until Monday and any code that I would write would be untested until then.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 24 total)
You must be logged in to reply to this topic. Login to reply