String Manipulation

  • Jeff Moden (3/16/2009)


    Goldie Graber (3/16/2009)


    Jeff Moden (3/16/2009)


    One of your requirements was to ....

    4. Replace all entity codes (“&xxx;”) with X

    If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).

    That's what I thought. Unfortunately, we can't assume that entity codes are exactly 5 characters.

    The ones which represent characters in other languages are longer.

    Here's the code I'm using now.

    -- Replace all other entity codes with an X

    SET @Start = CHARINDEX('&', @HTMLText)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    WHILE (@Start > 0 AND @End > 0 AND @Length > 0)

    BEGIN

    IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0

    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')

    SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)

    SET @End = CHARINDEX(';', @HTMLText, @Start)

    SET @Length = (@End - @Start) + 1

    END

    What's the maximum length they can be?

    Not sure. I was told I can not assume any specific size.

    The rule I need to use is that there are no spaces in between the numbers/letters of the entity codes.

    So something like "Joe & Mary went to the store; they bought fruit" would not be replaced but something like & #21855; or & #21205; would be replaced.

  • OK, I got my transducer written, ... and it's slow as heck. I'm pretty sure that it's either the "Managed Code" safety overhead or else some problem with VB.net's implementation of CASE. I may have to record it in C# (ugh). Anyway, I won't be able to get back to this until Friday, sorry. 🙁

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • OK, I lied. I took some time and worked on it this morning and got it working right.

    To test it a made a table consisting of a million rows of Goldie's sample HTML:

    [font="Courier New"]CREATE TABLE HtmlTest(id INT IDENTITY(1,1), HTML VARCHAR(MAX))

    GO

    INSERT INTO HtmlTest(html)

    SELECT '<HTML><Body>

    <b><font face="Courier New" size="3">DOSAGE:</font> Adults--1 to 2

    teaspoonfuls. Children six to twelve--1/2 to 1 teaspoonful.

    Children two to six years--1/2 teaspoonful. These doses may be

    repeated in four hours if necessary but not more than four times in

    twenty-four hours.</b> WARNING: Persons with a high fever or

    persistent cough should not use this preparation unless directed by

    a physician. Do not exceed recommended dosage.<br />

    </Body></HTML>

    '

    FROM (SELECT TOP 1000000 c1.id

       FROM MASTER.sys.syscolumns c1, MASTER.sys.syscolumns c2

    ) a

    GO

    [/font]

    then executed the following against it:

    select convert(varchar(25), getdate(), 121)

    --null test, to use as a baseline:

    select * from HtmlTest

    Where datalength(HTML) < 10

    select convert(varchar(25), getdate(), 121)

    -- My CLR function:

    select * from HtmlTest

    Where datalength(Cast(dbo.HTMLCleaner(CAST(HTML as varbinary(MAX))) as varchar(max))) < 10

    select convert(varchar(25), getdate(), 121)

    -- Goldie's latest function

    Select * from HTMLTest

    Where datalength(dbo.StripHTML(HTML)) <10

    select convert(varchar(25), getdate(), 121)

    --Jeff's function

    Select * from HTMLTest

    Where datalength(dbo.HTML_jbm(HTML)) <10

    select convert(varchar(25), getdate(), 121)

    Jeff's & Goldie's functions were pretty close (32 minutes & 31 minutes, apprx.), but the CLR function blew the doors off of both at 6.5 minutes. Thats almost 5 times as fast! I'll post the VB code in a minute...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hmm, funky new code highlighting script is making my firefox browser jumpy... Not sure how it'll like VB code...

    Ugh. Not good. I'll have to use the simple-talk prettifier...

    Great, now the prettifier isn't working either. 🙁 I'll just have to attach it

    OK, my SQLCLR Transducer function is attached.

    Bet there is something seriously wrong with the site right now...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Very cool Barry!

    Wish I could make use of it.

  • Thanks, Goldie.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • [font="Verdana"]I learned these things as "Finite State Machines", or in general, Finite State Automata. Very cool. 😀

    I've always recommended to people that if they are doing a lot of string processing, the CLR is probably the best way to go. Nice to see a good example in practice.

    Well done!

    Just out of interest, it's not too hard to build a generalised state machine from data held in tables... Not going to recommend that in this case, as the time to instantiate the state machine would be too high.

    [/font]

  • Bruce W Cassidy (3/18/2009)


    I learned these things as "Finite State Machines", or in general, Finite State Automata. Very cool. 😀

    Thanks Bruce. Finite-State Machines & Automata (FSM/FSA) are the general class of procedure that DFSA Transducer's are a part of. A Deterministic FSA (DFSA) if just one that drives entierly off of the current state and does not have to look-ahead or behind at all (I broke that rule in a couple of places for sanity/speed).

    A Transducer is just a type of FSA that does string transforms through character-by-character state processing and replacements (or token by token in other cases). Generally it's single character/token in and zero to many out.

    Just out of interest, it's not too hard to build a generalised state machine from data held in tables... Not going to recommend that in this case, as the time to instantiate the state machine would be too high.

    Oh, building the state and transition tables is easy in SQL, it's doing the state processing efficiently that's hard.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • [font="Verdana"]Yeah, I did that stuff 20 years ago when I built a compiler. Gotten a bit rusty on the details since then.

    I wouldn't build one in SQL! However, I have (in the past) built one that stored the details in SQL. Er, that was over SQL Server 6.5 using VB4 I think. Erk. A while back now.

    [/font]

  • Yeah, "do 1 thing, change state, check state, do another thing..." isn't really SQL's forte'. Now "do everything at once", that's something that SQL is good at!

    (Edit: Spelling, typos)

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (3/17/2009)


    Hmm, funky new code highlighting script is making my firefox browser jumpy... Not sure how it'll like VB code...

    Ugh. Not good. I'll have to use the simple-talk prettifier...

    Great, now the prettifier isn't working either. 🙁 I'll just have to attach it

    OK, my SQLCLR Transducer function is attached.

    Bet there is something seriously wrong with the site right now...

    Well now that Steve & company have fixed the new format AND given us language labeling as well, I can finally list my VB code in-line:

    Imports System

    Imports System.Data

    Imports System.Data.SqlClient

    Imports System.Data.SqlTypes

    Imports Microsoft.SqlServer.Server

    Partial Public Class UserDefinedFunctions

    Public Enum States

    Space1

    Entity

    HTMLTag

    Norm

    Word1

    Script

    Style

    End Enum

    Enum SubStates

    None

    EndBegin

    EndSlash

    End Enum

    Const CharSpace As Integer = 32

    Const CharAmp As Integer = 38

    Const CharSlash As Integer = 47

    Const CharLT As Integer = 60

    Const CharGT As Integer = 62

    Const CharA As Integer = 65

    Const CharX As Integer = 88

    Const CharZ As Integer = 90

    Const Char_a As Integer = 97

    Const Char_b As Integer = 98

    Const Char_n As Integer = 110

    Const Char_p As Integer = 112

    Const Char_s As Integer = 115

    Const Char_z As Integer = 122

    Const CharDash As Integer = 45

    Const CharSemiC As Integer = 59

    <Microsoft.SqlServer.Server.SqlFunction( _

    DataAccess:=DataAccessKind.None _

    , IsDeterministic:=True _

    , IsPrecise:=True)> _

    Public Shared Function HTMLCleaner(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes

    Dim b As Byte

    Dim i As Integer, j As Integer

    Dim Out As Byte()

    Dim State As States = States.Space1

    Dim Substate As SubStates = SubStates.None

    Dim strAccum As String = ""

    ReDim Out(0 To chars.Length - 1)

    For i = 0 To chars.Length - 1

    b = chars(i)

    Select Case State

    Case States.Norm

    Select Case b

    Case CharA To CharZ, Char_a To Char_z

    Out(j) = b

    j = j + 1

    State = States.Norm

    Case CharSpace

    Out(j) = b

    j = j + 1

    State = States.Space1

    Case CharAmp

    State = States.Entity 'skip output

    Case CharLT

    Out(j) = CharSpace

    j = j + 1

    State = States.HTMLTag

    Case CharDash

    Out(j) = b

    j = j + 1

    State = States.Norm

    Case Else

    State = States.Norm 'skip output

    End Select

    Case States.Space1

    Select Case b

    Case CharSpace

    'discard leading & multiple spaces

    Case CharAmp

    State = States.Entity 'skip output

    Case CharLT

    Out(j) = CharSpace

    j = j + 1

    State = States.HTMLTag

    Case CharDash

    Out(j) = b

    j = j + 1

    State = States.Norm

    Case CharA To CharZ, Char_a To Char_z

    Out(j) = b

    j = j + 1

    State = States.Word1

    Case Else

    State = States.Norm 'skip output

    End Select

    Case States.Word1

    Select Case b

    Case CharSpace

    'sinlge char word, retract from output:

    j = j - 1

    State = States.Space1

    Case CharAmp

    State = States.Entity 'skip output

    Case CharLT

    Out(j) = CharSpace

    j = j + 1

    State = States.HTMLTag

    Case CharDash

    Out(j) = b

    j = j + 1

    State = States.Norm

    Case CharA To CharZ, Char_a To Char_z

    Out(j) = b

    j = j + 1

    State = States.Norm

    Case Else

    State = States.Norm 'skip output

    End Select

    Case States.Entity

    Select Case b

    Case CharSemiC

    'End of entity, wrap it up:

    If strAccum = "nbsp" Then

    Out(j) = CharSpace

    j = j + 1

    strAccum = ""

    State = States.Space1

    Else

    'output "X"

    Out(j) = CharX

    j = j + 1

    State = States.Norm

    End If

    Case Else

    'else, keep scanning for semicolon...

    ' accumulate entity chars:

    strAccum = strAccum & b

    End Select

    Case States.HTMLTag

    If b = CharGT Then

    If strAccum = "SCRIPT" Then

    strAccum = ""

    State = States.Script

    Substate = SubStates.None

    ElseIf strAccum = "STYLE" Then

    strAccum = ""

    State = States.Style

    Substate = SubStates.None

    Else

    Out(j) = CharSpace

    j = j + 1

    State = States.Space1

    strAccum = ""

    End If

    Else

    'accumulate tag name

    strAccum = strAccum & b

    End If

    Case States.Script

    Select Case Substate

    Case SubStates.None

    If b = CharGT Then

    Substate = SubStates.EndBegin

    End If

    Case SubStates.EndBegin

    If b = CharSlash Then

    Substate = SubStates.EndSlash

    strAccum = ""

    Else

    Substate = SubStates.None

    End If

    Case SubStates.EndSlash

    If b = CharGT Then

    If strAccum = "SCRIPT" Then

    'end of script found; output nothing

    State = States.Norm

    Substate = SubStates.None

    Else

    'false alarm, back to script-scanning

    Substate = SubStates.None

    End If

    Else

    'accumulate the end-tag's label

    strAccum = strAccum & b

    End If

    End Select

    Case States.Style

    Select Case Substate

    Case SubStates.None

    If b = CharGT Then

    Substate = SubStates.EndBegin

    End If

    Case SubStates.EndBegin

    If b = CharSlash Then

    Substate = SubStates.EndSlash

    strAccum = ""

    Else

    Substate = SubStates.None

    End If

    Case SubStates.EndSlash

    If b = CharGT Then

    If strAccum = "STYLE" Then

    'end of script found; output nothing

    State = States.Norm

    Substate = SubStates.None

    Else

    'false alarm, back to script-scanning

    Substate = SubStates.None

    End If

    Else

    'accumulate the end-tag's label

    strAccum = strAccum & b

    End If

    End Select

    Case Else

    Debug.Assert(1 = 0)

    End Select

    'extra check for multiple spaces

    If j > 1 _

    AndAlso (Out(j - 1) = CharSpace _

    And Out(j - 2) = CharSpace) Then

    j = j - 1 'roll back the last character

    ElseIf j = 1 AndAlso Out(0) = CharSpace Then

    j = 0 'overwrite leading space

    End If

    Next

    'remove any trailing space:

    If j > 0 AndAlso Out(j - 1) = CharSpace Then j = j - 1

    'trim off the trailing excess

    ReDim Preserve Out(0 To j - 1)

    Return New SqlBytes(Out)

    End Function

    <Microsoft.SqlServer.Server.SqlFunction( _

    DataAccess:=DataAccessKind.None _

    , IsDeterministic:=True _

    , IsPrecise:=True)> _

    Public Shared Function HTMLCopy2(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes

    Dim out() As Byte

    ReDim out(0 To chars.Length - 1)

    For i As Integer = 0 To chars.Length - 1

    out(i) = chars.Buffer(i)

    Next

    Return New SqlBytes(out)

    End Function

    <Microsoft.SqlServer.Server.SqlFunction( _

    DataAccess:=DataAccessKind.None _

    , IsDeterministic:=True _

    , IsPrecise:=True)> _

    Public Shared Function HTMLCopy(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes

    ' Add your code here

    Return New SqlTypes.SqlBytes(chars.Buffer)

    End Function

    End Class

    Sweet!

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Man... I just went back and looked at the code I posted... these bloody new code windows added an extra charindex, changed double spaces to single spaces, and did a whole bunch of other nasty things. Not sure what it changed in it, but it also makes Goldie's orginal code not function as she said it would (it leaves out the spaces in the final result).

    Steve, if you see this, put it back the old way... I just can't imagine what it's doing to the rest of the code I've posted on this forum.

    Barry, here's the test code I'm playing with... the prettifier worked just fine for me...

    [font="Courier New"]DECLARE @MyHtml VARCHAR(8000)

        SET @MyHtml = '

    <HTML><Body>

    <b><font face="Courier New" size="3">DOSAGE:</font> Adults--1 to 2

    teaspoonfuls. Children six to twelve--1/2 to 1 teaspoonful.

    Children two to six years--1/2 teaspoonful. These doses may be

    repeated in four hours if necessary but not more than four times in

    twenty-four hours.</b> WARNING: Persons with a high fever or

    persistent cough should not use this preparation unless directed by

    a physician. Do not exceed recommended dosage.<br />

    </Body></HTML>

    '

     SELECT TOP 1000 @MyHtml AS MyHtml

       INTO #MyHead

       FROM dbo.Tally[/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ... and here's the code, all cleaned up so it can actually be copied correctly instead of the mess the code window made. Barry, if you test compare against the VB, it should would be nice to know.

    I was able to make it just a touch faster than Goldie's version... about 1 second faster every thousand rows... not much.

    [font="Courier New"]DROP FUNCTION dbo.JBMStripHTML

    GO

    CREATE FUNCTION dbo.JBMStripHTML

            (@HTMLText VARCHAR(MAX))

    RETURNS VARCHAR(MAX)

         AS

      BEGIN

    --===== Replace all HTML tags with nothing

      WHILE CHARINDEX('<',@HTMLText) > 0

     SELECT @HTMLText = STUFF(@HTMLText,

                               CHARINDEX('<',@HTMLText),

                               CHARINDEX('>',@HTMLText)-CHARINDEX('<',@HTMLText)+1,

                               '')

    --===== Replace all EntityCodes with space

      WHILE CHARINDEX('&',@HTMLText) > 0

     SELECT @HTMLText = STUFF(@HTMLText,

                               CHARINDEX('&',@HTMLText),

                               CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)+1,

                               '')

    --===== Replace all special characters (except dashes and spaces) and digits with a space

     SELECT @HTMLText = STUFF(@HTMLText,t.N,1,' ')

       FROM dbo.Tally t

      WHERE t.N <= LEN(@HTMLText)

        AND SUBSTRING(@HTMLText,t.N,1) LIKE '%[^- A-Z]%'

    --===== Replace single letter words with a space

      WHILE PATINDEX('% _ %',@HTMLText) > 0

     SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('% _ %',@HTMLText),3,' ')

    --===== Replace multiple spaces with a single space

      WHILE CHARINDEX('  ',@HTMLText) > 0

     SELECT @HTMLText = REPLACE(@HTMLText,'  ',' ')

    --===== Drop any leading or trailing spaces

     RETURN LTRIM(RTRIM(@HTMLText))

        END

    GO

    [/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/28/2009)


    Barry, here's the test code I'm playing with... the prettifier worked just fine for me...

    Yeah, I discovered that the prettifier works OK outside of the code windows. The problem is that it used to work inside of the code windows, so there are a lot of posts out there now that look like Goldie's final version (around post 30 in this thread?).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hmm, I thought that this is what I already tested, but sure, I'll try it just in case it isn't...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 46 through 60 (of 60 total)

You must be logged in to reply to this topic. Login to reply