String Replacement using Table Values

  • Ok, I'm able to get a good translation of data that is all upper case to title case ("THE WORLD" to "The World"). Problem is, some of the upper case values have acronyms in them, and I need to set them back to all upper case after the title case conversion happens. The code I started to use to do this is:

    create function titleCase(@Text as varchar(8000))

    returns varchar(8000)

    as

    begin

    declare @Reset bit;

    declare @Ret varchar(8000);

    declare @i int;

    declare @C char(1);

    select @Reset = 1, @i=1, @Ret = '';

    while (@i <= len(@Text))

    select @C= substring(@Text,@i,1),

    @Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,

    @Ret = replace(@Ret, '_', ' '),

    @Ret = replace(@Ret, 'Rsms', 'RSMS'),

    @Ret = replace(@Ret, 'Aip', 'AIP'),

    @Ret = replace(@Ret, 'Arp', 'ARP'),

    @Ret = replace(@Ret, 'Av ', 'AV '),

    @Ret = replace(@Ret, 'Cnrp', 'CNRP'),

    @Ret = replace(@Ret, 'Dnl', 'DNL'),

    @Ret = replace(@Ret, 'Gps', 'GPS'),

    @Ret = replace(@Ret, 'Mc', 'MC'),

    @Ret = replace(@Ret, 'Vars', 'VARS'),

    @Ret = replace(@Ret, 'Nvlap', 'NVLAP'),

    @Ret = replace(@Ret, 'Nro', 'NRO'),

    @Ret = replace(@Ret, 'Oo', 'OO'),

    @Ret = replace(@Ret, 'Wds', 'WDS'),

    @Ret = replace(@Ret, 'Qc', 'QC'),

    @Ret = replace(@Ret, 'Rfi', 'RFI'),

    @Ret = replace(@Ret, ' Ca ', ' CA '),

    @Reset = case when @C like '[a-zA-Z]' then 0 else 1 end,

    @i = @i +1

    return @Ret

    end

    go

    However, this seems a little too static in case an acronym is added, etc. So what I did is create an acronym table ("METADATA_ACRONYM") and I want to read two values from it, the bad capitalization and the correct acronym capitalization (Gps to GPS, etc.) I'm trying to figure out how to replace the static code above with an iterative process that happens each time. Any suggestions? Was this a detailed enough request, or is more info needed. Thanks in advance for any suggestions.

    Brad

  • Since you're using the replace command (and I can't think of another way to do what you need), I think your best bet is a cursor to step through the accronyms table and run a replace on each one.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • This is one of those cases where CLR would likely be substantially more efficient than the loop. First - you could leverage the "Proper Case" syntax, so that you don't have to do this "by hand". And then - apply the replacements in order. I'd recommend keeping the replacements in a table, and passing them to the CLR either as XML (not bad), or a delimited list (probably better).

    the CLR function would then use the delimited list to apply the replacements in sequence.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Hmmm...ok, I understand what you are saying in concept, but my experience here is probably more than beginner but less than intermediate, so I need more guidance on how you would implement what you suggest. I understand the above cursor suggestion a little better, and have actually played with implementing it that way, but not completely successful yet. Either way I might see what other guidance you can provide. 😐

    Brad

  • I can't speak for the CLR implementation Matt is suggesting, but the cursor would look something like:

    declare C1 cursor local fast_forward for

    select Wrong, Right

    from Metadata_Acronyms

    open c1

    declare @sql nvarchar(1000), @Wrong nvarchar(100), @Right nvarchar(100)

    fetch next from c1

    into @wrong, @right

    while @@fetch_status > 0

    begin

    select @sql = n'update dbo.Table set Field = replace(field, @wrong, @right)'

    exec sp_executesql @sql, n'@wrong nvarchar(100), @right nvarchar(100)', @wrong, @right

    fetch next from c1

    into @wrong, @right

    end

    (I haven't written a lot of dynamic SQL, but I think this will work.)

    To write a CLR, you'll need to write it in on of the .NET languages (VB.NET, C#.NET, etc.). I'm in the process of teaching myself how to do that, but can't help you on that solution yet.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks, I will give that a shot and let you know the results.

    On a similar theme...is it possible to run spell check of some sort on the data in a column? I can look manually through all 3992+ records to determine where acronyms may have been mis-capitalized, but if I could spell check things it would immediately help me focus on the acronyms.

    Thoughts?

    Brad

  • I'm sure it's possible, but I've never seen anything that would do it.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • This would be the CLR user-defined function you'd need:

    Imports System

    Imports System.Data

    Imports System.Data.SqlClient

    Imports System.Data.SqlTypes

    Imports Microsoft.SqlServer.Server

    Imports System.Text.RegularExpressions

    Partial Public Class UserDefinedFunctions

    <Microsoft.SqlServer.Server.SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _

    Public Shared Function RegexReplace(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal rep As SqlString) As SqlString

    ' Add your code here

    Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value)

    If input.IsNull Then

    Return New SqlString(SqlString.Null.Value)

    Else

    Return New SqlString(rex.Replace(New String(input.Value), rep.Value))

    End If

    End Function

    <Microsoft.SqlServer.Server.SqlFunction(IsDeterministic:=True, IsPrecise:=True)> _

    Public Shared Function MyCustomCase(ByVal input As SqlChars, ByVal Keeplist As SqlString, ByVal separator As SqlString) As SqlString

    Dim tmp As String

    Dim i As Integer

    Dim rex_inner As System.Text.RegularExpressions.Regex

    Dim keeps As String()

    Dim keepone As String

    Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(separator.Value)

    keeps = rex.Split(Keeplist.Value)

    tmp = StrConv(New String(input.Value), VbStrConv.ProperCase) & " "

    For i = 0 To keeps.GetUpperBound(0)

    If Len(keeps(i)) > 0 Then

    keepone = "\s" & keeps(i) & "\s"

    rex_inner = New System.Text.RegularExpressions.Regex(keepone, RegexOptions.IgnoreCase)

    tmp = rex_inner.Replace(tmp, " " & StrConv(keeps(i), VbStrConv.Uppercase) & " ")

    End If

    Next

    Return Left(tmp, tmp.Length - 1)

    End Function

    End Class

    At that point, once you deploy that to your server, you could use it like this:

    declare @teststring varchar(2000)

    set @teststring=' this is the test of the emergency IBM broadcast system. gm gmbastor lotus';

    declare @keeplist varchar(2000)

    set @keeplist='IBM||GM||LOTUS||';

    declare @separator varchar(5)

    set @separator='\|\|';

    select dbo.mycustomCase(@teststring,@keeplist,@separator)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Spell-checking is not available i think. It involves something interactive on a .NET control (the CLR we're talking here is not interactive and not user-control-based.)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • As far as "building" the list you'd want for the "Acronyms to keep" list - use any one of the solutions identified here:

    http://www.sqlservercentral.com/articles/Test+Data/61572/[/url]

    Since I assume you want to run this against a table, the "Keep list" should be built once, and then used after that. Cycling on each character and replacement pattern for each row is going to be REALLY slow.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • ... and, no matter how you shape it, a cursor or While Loop is going to kill performance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/31/2008)


    ... and, no matter how you shape it, a cursor or While Loop is going to kill performance.

    And just to prove that:

    Here's the test data:

    drop table matt1

    create table matt1(

    id int identity(1,1) primary key clustered,

    strlong varchar(1000),

    strlongproper varchar(1000))

    go

    insert matt1(strlong)

    select top 10000

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20) +' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+' apple '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'ibm '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'lotus '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+

    replicate(char(65+rand(checksum(newid()))*62),rand(checksum(newid()))*20)+' '+'lotus'

    from

    sys.all_columns sc1,

    sys.all_columns sc2

    drop table matt2

    select * --make a copy for the second test

    into matt2

    from matt1

    Now - the test script:

    set nocount on

    --====Some Declarations

    declare @g datetime;

    declare @keeplist varchar(100);

    declare @term varchar(6);

    set @term='\|\|';

    set @keeplist='IBM||Lotus||apple||gm||GE';

    --==Set up the test header

    print replicate('=',60)

    print 'Proper+regex'

    print replicate('=',60)

    set @g=getdate();

    update matt1

    set strlongproper=dbo.MyCustomCase(strlong,@keeplist,@term);

    --print the time

    print datediff(ms,@g,getdate());

    go

    --====Test #2

    --====Some Declarations

    declare @Reset bit;

    declare @text varchar(1000);

    declare @Ret varchar(1000);

    declare @id int;

    declare @i int;

    declare @C char(1);

    declare @g datetime;

    print replicate('=',60)

    print 'Cursor loop'

    print replicate('=',60)

    set @g=getdate();

    --===The Cursor

    declare curse cursor local fast_forward for

    select id, strlong from matt2

    open curse

    fetch next from curse into @id,@text;

    --===The While Loop

    While @@fetch_status=0

    BEGIN

    select @Reset = 1, @i=1, @Ret = '';

    while (@i <= len(@Text))

    BEGIN

    select @C= substring(@Text,@i,1),

    @Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,

    @Reset = case when @C like '[a-zA-Z]' then 0 else 1 end,

    @i = @i +1

    --EDIT - moved the second set of replaces out of the loop - only need to run one time

    END

    --DONE EDIT - moved the second set of replaces out of the loop - only need to run one time

    select

    @Ret = replace(@Ret, '_', ' '),

    @Ret = replace(@Ret, ' Lotus ', ' LOTUS '),

    @Ret = replace(@Ret, ' Ibm ', ' IBM '),

    @Ret = replace(@Ret, ' Apple ', 'APPLE '),

    @Ret = replace(@Ret, ' Gm ', ' GM '),

    @Ret = replace(@Ret, ' Ge ', ' GE ')

    --stored the value we just created

    update matt2

    set strlongproper=@ret

    where ID=@id

    fetch next from curse into @id,@text;

    END

    close curse

    deallocate curse

    print datediff(ms,@g,getdate());

    set nocount off

    And the results:

    ============================================================

    Proper+regex

    ============================================================

    4486

    ============================================================

    Cursor loop

    ============================================================

    215793

    By the way - that 48 times faster, and that's only at 10,000 rows. The bigger the set, the slower the cursor seems to get.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks, guys. I can see that the cursor solution will be slower, so I will work on implementing it the CLR way. I'll give a shout if I have other questions regarding anything there. I really appreciate your taking time to respond!

    Brad

  • Greetings all... I had an issue that came up some time ago in my SS 2000 box. I was able to find an extended function that allowed me to use a regular expression to find a text sequence in a string and remove it.

    Too bad it can't be ported over to SS 2005. As someone else has pointed out it isn't the fastest phase of the job but it does work.

    More specifically I have email address in the body of text that has to be removed. Process takes about 1 hr per 3000 rows of data.

    Kurt W. Zimmerman
    SR DBA
    Lefrak Organization
    New York, NY

    http://www.linkedin.com/in/kurtwzimmerman

  • Kurt W. Zimmerman (4/1/2008)


    Greetings all... I had an issue that came up some time ago in my SS 2000 box. I was able to find an extended function that allowed me to use a regular expression to find a text sequence in a string and remove it.

    Too bad it can't be ported over to SS 2005. As someone else has pointed out it isn't the fastest phase of the job but it does work.

    More specifically I have email address in the body of text that has to be removed. Process takes about 1 hr per 3000 rows of data.

    Kurt - that piece of code above DOES use built-in Regex functionality (it's in the CLR function). Regex used that way is truly fast in 2005 (at least 3-5 times faster than the XP version from 2000 from what I've seen). And it blows the doors off of the cursor. We ran a test like that previously and 3000 e-mail addresses would be done sub-second. Depending on how big your "body of text" is, I truly doubt it would take 30 seconds, let alone a minute.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply