String Replacement using Table Values

Question

String Replacement using Table Values

oleson

SSC-Addicted

Points: 424
More actions
March 27, 2008 at 4:12 pm

#188550

Ok, I'm able to get a good translation of data that is all upper case to title case ("THE WORLD" to "The World"). Problem is, some of the upper case values have acronyms in them, and I need to set them back to all upper case after the title case conversion happens. The code I started to use to do this is:
create function titleCase(@Text as varchar(8000))
returns varchar(8000)
as
begin
declare @Reset bit;
declare @Ret varchar(8000);
declare @i int;
declare @C char(1);
select @Reset = 1, @i=1, @Ret = '';
while (@i <= len(@Text))
select @C= substring(@Text,@i,1),
@Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,
@Ret = replace(@Ret, '_', ' '),
@Ret = replace(@Ret, 'Rsms', 'RSMS'),
@Ret = replace(@Ret, 'Aip', 'AIP'),
@Ret = replace(@Ret, 'Arp', 'ARP'),
@Ret = replace(@Ret, 'Av ', 'AV '),
@Ret = replace(@Ret, 'Cnrp', 'CNRP'),
@Ret = replace(@Ret, 'Dnl', 'DNL'),
@Ret = replace(@Ret, 'Gps', 'GPS'),
@Ret = replace(@Ret, 'Mc', 'MC'),
@Ret = replace(@Ret, 'Vars', 'VARS'),
@Ret = replace(@Ret, 'Nvlap', 'NVLAP'),
@Ret = replace(@Ret, 'Nro', 'NRO'),
@Ret = replace(@Ret, 'Oo', 'OO'),
@Ret = replace(@Ret, 'Wds', 'WDS'),
@Ret = replace(@Ret, 'Qc', 'QC'),
@Ret = replace(@Ret, 'Rfi', 'RFI'),
@Ret = replace(@Ret, ' Ca ', ' CA '),
@Reset = case when @C like '[a-zA-Z]' then 0 else 1 end,
@i = @i +1
return @Ret
end
go
However, this seems a little too static in case an acronym is added, etc. So what I did is create an acronym table ("METADATA_ACRONYM") and I want to read two values from it, the bad capitalization and the correct acronym capitalization (Gps to GPS, etc.) I'm trying to figure out how to replace the static code above with an iterative process that happens each time. Any suggestions? Was this a detailed enough request, or is more info needed. Thanks in advance for any suggestions.
Brad

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Since you're using the replace command (and I can't think of another way to do what you need), I think your best bet is a cursor to step through the accronyms table and run a replace on each one.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 2

This is one of those cases where CLR would likely be substantially more efficient than the loop. First - you could leverage the "Proper Case" syntax, so that you don't have to do this "by hand". And then - apply the replacements in order. I'd recommend keeping the replacements in a table, and passing them to the CLR either as XML (not bad), or a delimited list (probably better).

the CLR function would then use the delimited list to apply the replacements in sequence.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

oleson SSC-Addicted Points: 424 More actions · Answer 3

Hmmm...ok, I understand what you are saying in concept, but my experience here is probably more than beginner but less than intermediate, so I need more guidance on how you would implement what you suggest. I understand the above cursor suggestion a little better, and have actually played with implementing it that way, but not completely successful yet. Either way I might see what other guidance you can provide. 😐

Brad

GSquared SSC Guru Points: 260824 More actions · Answer 4

I can't speak for the CLR implementation Matt is suggesting, but the cursor would look something like:

declare C1 cursor local fast_forward for

select Wrong, Right

from Metadata_Acronyms

open c1

declare @sql nvarchar(1000), @Wrong nvarchar(100), @Right nvarchar(100)

fetch next from c1

into @wrong, @right

while @@fetch_status > 0

begin

select @sql = n'update dbo.Table set Field = replace(field, @wrong, @right)'

exec sp_executesql @sql, n'@wrong nvarchar(100), @right nvarchar(100)', @wrong, @right

fetch next from c1

into @wrong, @right

end

(I haven't written a lot of dynamic SQL, but I think this will work.)

To write a CLR, you'll need to write it in on of the .NET languages (VB.NET, C#.NET, etc.). I'm in the process of teaching myself how to do that, but can't help you on that solution yet.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

oleson SSC-Addicted Points: 424 More actions · Answer 5

Thanks, I will give that a shot and let you know the results.

On a similar theme...is it possible to run spell check of some sort on the data in a column? I can look manually through all 3992+ records to determine where acronyms may have been mis-capitalized, but if I could spell check things it would immediately help me focus on the acronyms.

Thoughts?

Brad

GSquared SSC Guru Points: 260824 More actions · Answer 6

I'm sure it's possible, but I've never seen anything that would do it.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

This would be the CLR user-defined function you'd need:

Imports System

Imports System.Data

Imports System.Data.SqlClient

Imports System.Data.SqlTypes

Imports Microsoft.SqlServer.Server

Imports System.Text.RegularExpressions

Partial Public Class UserDefinedFunctions

<Microsoft.SqlServer.Server.SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _

Public Shared Function RegexReplace(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal rep As SqlString) As SqlString

' Add your code here

Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value)

If input.IsNull Then

Return New SqlString(SqlString.Null.Value)

Else

Return New SqlString(rex.Replace(New String(input.Value), rep.Value))

End If

End Function

<Microsoft.SqlServer.Server.SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _

Public Shared Function MyCustomCase(ByVal input As SqlChars, ByVal Keeplist As SqlString, ByVal separator As SqlString) As SqlString

Dim tmp As String

Dim i As Integer

Dim rex_inner As System.Text.RegularExpressions.Regex

Dim keeps As String()

Dim keepone As String

Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(separator.Value)

keeps = rex.Split(Keeplist.Value)

tmp = StrConv(New String(input.Value), VbStrConv.ProperCase) & " "

For i = 0 To keeps.GetUpperBound(0)

If Len(keeps(i)) > 0 Then

keepone = "\s" & keeps(i) & "\s"

rex_inner = New System.Text.RegularExpressions.Regex(keepone, RegexOptions.IgnoreCase)

tmp = rex_inner.Replace(tmp, " " & StrConv(keeps(i), VbStrConv.Uppercase) & " ")

End If

Next

Return Left(tmp, tmp.Length - 1)

End Function

End Class

At that point, once you deploy that to your server, you could use it like this:

declare @teststring varchar(2000)

set @teststring=' this is the test of the emergency IBM broadcast system. gm gmbastor lotus';

declare @keeplist varchar(2000)

set @keeplist='IBM||GM||LOTUS||';

declare @separator varchar(5)

set @separator='\|\|';

select dbo.mycustomCase(@teststring,@keeplist,@separator)

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

Spell-checking is not available i think. It involves something interactive on a .NET control (the CLR we're talking here is not interactive and not user-control-based.)

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 9

As far as "building" the list you'd want for the "Acronyms to keep" list - use any one of the solutions identified here:

http://www.sqlservercentral.com/articles/Test+Data/61572/[/url]

Since I assume you want to run this against a table, the "Keep list" should be built once, and then used after that. Cycling on each character and replacement pattern for each row is going to be REALLY slow.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004405 More actions · Answer 10

... and, no matter how you shape it, a cursor or While Loop is going to kill performance.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 11

Jeff Moden (3/31/2008)
... and, no matter how you shape it, a cursor or While Loop is going to kill performance.

And just to prove that:

Here's the test data:

drop table matt1

create table matt1(

id int identity(1,1) primary key clustered,

strlong varchar(1000),

strlongproper varchar(1000))

go

insert matt1(strlong)

select top 10000

replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20) +' '+

replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+' apple '+

replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'ibm '+

replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'lotus '+

replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'lotus'

from

sys.all_columns sc1,

sys.all_columns sc2

drop table matt2

select * --make a copy for the second test

into matt2

from matt1

Now - the test script:

set nocount on

--====Some Declarations

declare @g datetime;

declare @keeplist varchar(100);

declare @term varchar(6);

set @term='\|\|';

set @keeplist='IBM||Lotus||apple||gm||GE';

--==Set up the test header

print replicate('=',60)

print 'Proper+regex'

print replicate('=',60)

set @g=getdate();

update matt1

set strlongproper=dbo.MyCustomCase(strlong,@keeplist,@term);

--print the time

print datediff(ms,@g,getdate());

go

--====Test #2

--====Some Declarations

declare @Reset bit;

declare @text varchar(1000);

declare @Ret varchar(1000);

declare @id int;

declare @i int;

declare @C char(1);

declare @g datetime;

print replicate('=',60)

print 'Cursor loop'

print replicate('=',60)

set @g=getdate();

--===The Cursor

declare curse cursor local fast_forward for

select id, strlong from matt2

open curse

fetch next from curse into @id,@text;

--===The While Loop

While @@fetch_status=0

BEGIN

select @Reset = 1, @i=1, @Ret = '';

while (@i <= len(@Text))

BEGIN

select @C= substring(@Text,@i,1),

@Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,

@Reset = case when @C like '[a-zA-Z]' then 0 else 1 end,

@i = @i +1

--EDIT - moved the second set of replaces out of the loop - only need to run one time

END

--DONE EDIT - moved the second set of replaces out of the loop - only need to run one time

select

@Ret = replace(@Ret, '_', ' '),

@Ret = replace(@Ret, ' Lotus ', ' LOTUS '),

@Ret = replace(@Ret, ' Ibm ', ' IBM '),

@Ret = replace(@Ret, ' Apple ', 'APPLE '),

@Ret = replace(@Ret, ' Gm ', ' GM '),

@Ret = replace(@Ret, ' Ge ', ' GE ')

--stored the value we just created

update matt2

set strlongproper=@ret

where ID=@id

fetch next from curse into @id,@text;

END

close curse

deallocate curse

print datediff(ms,@g,getdate());

set nocount off

And the results:

============================================================

Proper+regex

============================================================

4486

============================================================

Cursor loop

============================================================

215793

By the way - that 48 times faster, and that's only at 10,000 rows. The bigger the set, the slower the cursor seems to get.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

oleson SSC-Addicted Points: 424 More actions · Answer 12

Thanks, guys. I can see that the cursor solution will be slower, so I will work on implementing it the CLR way. I'll give a shout if I have other questions regarding anything there. I really appreciate your taking time to respond!

Brad

Kurt W. Zimmerman SSCrazy Eights Points: 8676 More actions · Answer 13

Greetings all... I had an issue that came up some time ago in my SS 2000 box. I was able to find an extended function that allowed me to use a regular expression to find a text sequence in a string and remove it.

Too bad it can't be ported over to SS 2005. As someone else has pointed out it isn't the fastest phase of the job but it does work.

More specifically I have email address in the body of text that has to be removed. Process takes about 1 hr per 3000 rows of data.

Kurt W. Zimmerman
SR DBA
Lefrak Organization
New York, NY

http://www.linkedin.com/in/kurtwzimmerman

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 14

Kurt W. Zimmerman (4/1/2008)
Greetings all... I had an issue that came up some time ago in my SS 2000 box. I was able to find an extended function that allowed me to use a regular expression to find a text sequence in a string and remove it.
Too bad it can't be ported over to SS 2005. As someone else has pointed out it isn't the fastest phase of the job but it does work.
More specifically I have email address in the body of text that has to be removed. Process takes about 1 hr per 3000 rows of data.

Kurt - that piece of code above DOES use built-in Regex functionality (it's in the CLR function). Regex used that way is truly fast in 2005 (at least 3-5 times faster than the XP version from 2000 from what I've seen). And it blows the doors off of the cursor. We ran a test like that previously and 3000 e-mail addresses would be done sub-second. Depending on how big your "body of text" is, I truly doubt it would take 30 seconds, let alone a minute.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?