Help with using mulitple replace in a function

Question

Help with using mulitple replace in a function

Julie Breutzmann

Hall of Fame

Points: 3637
More actions
July 14, 2010 at 8:18 am

#235114

I would like to create a function that performs several replaces.
Our collation is case insensitive.
I want to compare two different records to try to determine if they are duplicates. Names and addresses often have different punctuation. I'd like to strip that out so that when I compare two names or addresses, I will find more matches. So O'Malley becomes OMalley, Mary Ann becomes MaryAnn, etc. I'd also like to change letters with tildes, accents, etc. to letters without them. What would be an efficient way to code this? Here's what I've done so far.
CREATE FUNCTION [dbo].[WRT_simple_name]
(
@LAST_NAME char(30)
)
RETURNS char(30)
AS
BEGIN
-- Declare the return variable here
DECLARE @L_NAME char(30)
-- Add the T-SQL statements to compute the return value here
Set @L_NAME =
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(
replace(rtrim(@LAST_NAME),'''','')
,'.','')
,' ','')
,'-','')
,'_','')
,'~','')
,'ž','z')
,'ó','o')
,'š','s')
,'í','i')
,'é','e')
-- Return the result of the function
RETURN @l_name
END
I do not intend to ACTUALLY replace the data, I just want this to have more "hits" on comparisons.

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 1

Julie Breutzmann (7/14/2010)
...
What would be an efficient way to code this?
...

If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

RichardDouglas SSCarpal Tunnel Points: 4029 More actions · Answer 2

Have you tried doing comparisons using the SOUNDEX function?

Hope this helps,
Rich

[p]
[/p]

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 3

rdouglas66 (7/14/2010)
Have you tried doing comparisons using the SOUNDEX function?

I have tried, have you? I don't think it will help much in what the OP trying to do...

Soundex is just a pity attemp of "fuzzy" search

Soundex('LLoyds') ! = Soundex('Loyds')

but

Soundex('Brighton') = Soundex('Bristol')

No much use!

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Julie Breutzmann Hall of Fame Points: 3637 More actions · Answer 4

Thanks for the suggestions.

While I don't dispute that CLR functions in C# are better, I am currently restricted to TSQL.

I tried using the soundex function and there were too many false positives.

Lowell SSC Guru Points: 323518 More actions · Answer 5

Julie try this function agaisnt your data.

this strips out everything that is not a number and not a character...so spaces, control characters, brackets or whatever get stripped out;

that seems to be what you are after for your comparison you want to try.

you might need to adapt it to substitute ,'ó','o', as it currently strips out the ,'ó'

it's a scalar function, so it's not the fastest, but this sounds like a one time cleanup, and not something that needs to be converted to an ITVF:

usage :

select dbo.StripNonAlphaNumeric(FieldName) ,FieldName from SomeTable

CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))

RETURNS VARCHAR(8000)

BEGIN

DECLARE @CleanedText VARCHAR(8000)

;WITH tally (N) as

(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

FROM Master.dbo.SysColumns sc1

CROSS JOIN Master.dbo.SysColumns sc2)

SELECT @CleanedText = ISNULL(@CleanedText,'') +

CASE

--ascii numbers are 48(for '0') thru 57 (for '9')

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57

THEN SUBSTRING(@OriginalText,Tally.N,1)

--ascii upper case letters A-Z is 65 thru 90

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90

THEN SUBSTRING(@OriginalText,Tally.N,1)

--ascii lower case letters a-z is 97 thru 122

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END

FROM tally WHERE Tally.N <= LEN(@OriginalText)

RETURN @CleanedText

END

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Julie Breutzmann Hall of Fame Points: 3637 More actions · Answer 6

That looks promising, Lowell. I'd need to add a piece for special letters like those with accents, umlauts, etc. I've used tally tables before, thanks to Jeff Moden's well-written article, so I understand how this works.

It's not something for "production" but we do occasionally get large files (50,000 records) to add to our records so avoiding duplication is important.

Thanks for your suggestion.

Lowell SSC Guru Points: 323518 More actions · Answer 7

Joe Celko (7/14/2010)
SQL is not a text language. Get a copy of Mellisa Data or other address data scrubbing tools. They will get you to CASS and 9-1-1 Standards much quicker than trying to learn then code them in your own SQL.

nah, if you are familiar with any tool, whether SQL or some programming language, there's no reason to buy/load/learn another application when you can use the existing tools to do the same work.

regardless of whether SQL is a "text" language, the job can be done there fast and easy, leaving the resulting data in a place where you can use it or update it;

external tools might scrub the data better , but the learning curve to use them, plus an additional cost, would not be worth it for the once in a while check for misspelled duplicates.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 8

Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...
What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...

Maybe in this case... maybe not. I've beaten certain CLR functions.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 9

Julie Breutzmann (7/14/2010)
That looks promising, Lowell. I'd need to add a piece for special letters like those with accents, umlauts, etc. I've used tally tables before, thanks to Jeff Moden's well-written article, so I understand how this works.
It's not something for "production" but we do occasionally get large files (50,000 records) to add to our records so avoiding duplication is important.
Thanks for your suggestion.

Hi Julie,

First, thanks for the compliment. Glad to have been of service.

Second, I agree with Joe Celko on this one. CASS Certification programs really don't cost that much (less than you'll spend trying to write your own code to some point of perfection) with the side effect of having fulling validated addresses. In other words, you'll not only be able to search for dupes, but you also be able to find out where customers have 1) flat out lied or 2) don't really know what their correct address is (and it happens more than you can believe) or 3) have phat phingered things so badly that there's no chance of determining if there's a dupe or not.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Julie Breutzmann Hall of Fame Points: 3637 More actions · Answer 10

I would love to purchase such a program. I have one in mind that is partnered with our vendor and would provide numerous benefits to our institution, but it's not gonna happen this year. I work in higher ed and our budget has been cut significantly. To make matters worse a significant portion of our budget is non-discretionary. 🙁

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 11

Julie Breutzmann (7/15/2010)
I work in higher ed and our budget has been cut significantly. To make matters worse a significant portion of our budget is non-discretionary. 🙁

Damn... I'll never understand politicians. :crazy:

Does Lowell's function do the trick for you? I'm a little concerned about performance there because of the scalar nature of the function but I won't be near a machine with T-SQL on it until Monday so I can't do much about it for demonstration purposes.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Julie Breutzmann Hall of Fame Points: 3637 More actions · Answer 12

I work at a private college, but it's still politics to some degree. Even though my department head explained this to the cabinet, I'm not sure that they understand or maybe protecting their own budgets mattered more to them.

I haven't had a chance to play with Lowell's suggestion, yet. Things have been too hectic and though important, it's currently not a high priority. I will certainly post to let people know how this works for me.

Julie

Julie Breutzmann Hall of Fame Points: 3637 More actions · Answer 13

I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.

This should be useable for files this size or smaller. Not so sure about larger ones, even though we only deal with large ones a few times a year. I may consider adding a table to the database with the id, and the converted first name, last name, and address line 1, with triggers to keep it up-dated. That way the function only needs to run on the (relatively) smaller input file and should be significantly faster. I probably won't get to this for a bit.

Here's the function I ended up with.

CREATE FUNCTION wrt_Simple_name_2(@OriginalText VARCHAR(100))

RETURNS VARCHAR(100)

BEGIN

DECLARE @CleanedText VARCHAR(100)

SELECT @CleanedText = ISNULL(@CleanedText,'') +

CASE

--ascii numbers are 48(for '0') thru 57 (for '9')

WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 48 AND 57

THEN SUBSTRING(@OriginalText,wrt_tally.N,1)

--ascii upper case letters A-Z is 65 thru 90

WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 65 AND 90

THEN SUBSTRING(@OriginalText,wrt_tally.N,1)

--ascii lower case letters a-z is 97 thru 122

WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 97 AND 122

THEN SUBSTRING(@OriginalText,wrt_tally.N,1)

--ascii "foreign" letters are 192 - 240

WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 193 AND 254

THEN table_detail.TBL_VALUE_ALT1

ELSE '' END

FROM wrt_tally

left outer join table_detail on ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) = table_detail.table_value

and table_detail.column_name = 'wrt_cm_letter_translation'

WHERE wrt_tally.N <= LEN(@OriginalText)

RETURN @CleanedText

END

A few comments:

Any custom tables, views, functions, etc. need to start with 'wrt' per the rules of our database vendor.

wrt_tally is our custom tally table.

table_detail is a code table that I used to "convert" letters such as "a umlaut" to "a".

I very much appreciate all the assistance offered.

Julie

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 14

Julie Breutzmann (7/16/2010)
I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.

That's what I was afraid of. There is a way to do this same thing using an iTVF (inline Table Valued Function) and a Cross Apply that should speed things up by (possibly) and order of magnitude. The problem is that I won't be at a machine with SQL Server on it until Monday and any code that I would write would be untested until then.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)