March 16, 2009 at 7:31 pm
Jeff Moden (3/16/2009)
Goldie Graber (3/16/2009)
Jeff Moden (3/16/2009)
One of your requirements was to ....
4. Replace all entity codes (“&xxx;”) with X
If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).
That's what I thought. Unfortunately, we can't assume that entity codes are exactly 5 characters.
The ones which represent characters in other languages are longer.
Here's the code I'm using now.
-- Replace all other entity codes with an X
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0)
BEGIN
IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')
SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
END
What's the maximum length they can be?
Not sure. I was told I can not assume any specific size.
The rule I need to use is that there are no spaces in between the numbers/letters of the entity codes.
So something like "Joe & Mary went to the store; they bought fruit" would not be replaced but something like & #21855; or & #21205; would be replaced.
March 16, 2009 at 8:40 pm
OK, I got my transducer written, ... and it's slow as heck. I'm pretty sure that it's either the "Managed Code" safety overhead or else some problem with VB.net's implementation of CASE. I may have to record it in C# (ugh). Anyway, I won't be able to get back to this until Friday, sorry. 🙁
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 17, 2009 at 8:34 am
OK, I lied. I took some time and worked on it this morning and got it working right.
To test it a made a table consisting of a million rows of Goldie's sample HTML:
[font="Courier New"]CREATE TABLE HtmlTest(id INT IDENTITY(1,1), HTML VARCHAR(MAX))
GO
INSERT INTO HtmlTest(html)
SELECT '<HTML><Body>
<b><font face="Courier New" size="3">DOSAGE:</font> Adults--1 to 2
teaspoonfuls. Children six to twelve--1/2 to 1 teaspoonful.
Children two to six years--1/2 teaspoonful. These doses may be
repeated in four hours if necessary but not more than four times in
twenty-four hours.</b> WARNING: Persons with a high fever or
persistent cough should not use this preparation unless directed by
a physician. Do not exceed recommended dosage.<br />
</Body></HTML>
'
FROM (SELECT TOP 1000000 c1.id
FROM MASTER.sys.syscolumns c1, MASTER.sys.syscolumns c2
) a
GO
[/font]
then executed the following against it:
select convert(varchar(25), getdate(), 121)
--null test, to use as a baseline:
select * from HtmlTest
Where datalength(HTML) < 10
select convert(varchar(25), getdate(), 121)
-- My CLR function:
select * from HtmlTest
Where datalength(Cast(dbo.HTMLCleaner(CAST(HTML as varbinary(MAX))) as varchar(max))) < 10
select convert(varchar(25), getdate(), 121)
-- Goldie's latest function
Select * from HTMLTest
Where datalength(dbo.StripHTML(HTML)) <10
select convert(varchar(25), getdate(), 121)
--Jeff's function
Select * from HTMLTest
Where datalength(dbo.HTML_jbm(HTML)) <10
select convert(varchar(25), getdate(), 121)
Jeff's & Goldie's functions were pretty close (32 minutes & 31 minutes, apprx.), but the CLR function blew the doors off of both at 6.5 minutes. Thats almost 5 times as fast! I'll post the VB code in a minute...
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 17, 2009 at 9:06 am
Hmm, funky new code highlighting script is making my firefox browser jumpy... Not sure how it'll like VB code...
Ugh. Not good. I'll have to use the simple-talk prettifier...
Great, now the prettifier isn't working either. 🙁 I'll just have to attach it
OK, my SQLCLR Transducer function is attached.
Bet there is something seriously wrong with the site right now...
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 18, 2009 at 8:01 am
Very cool Barry!
Wish I could make use of it.
March 18, 2009 at 6:37 pm
Thanks, Goldie.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 18, 2009 at 6:48 pm
[font="Verdana"]I learned these things as "Finite State Machines", or in general, Finite State Automata. Very cool. 😀
I've always recommended to people that if they are doing a lot of string processing, the CLR is probably the best way to go. Nice to see a good example in practice.
Well done!
Just out of interest, it's not too hard to build a generalised state machine from data held in tables... Not going to recommend that in this case, as the time to instantiate the state machine would be too high.
[/font]
March 18, 2009 at 7:51 pm
Bruce W Cassidy (3/18/2009)
I learned these things as "Finite State Machines", or in general, Finite State Automata. Very cool. 😀
Thanks Bruce. Finite-State Machines & Automata (FSM/FSA) are the general class of procedure that DFSA Transducer's are a part of. A Deterministic FSA (DFSA) if just one that drives entierly off of the current state and does not have to look-ahead or behind at all (I broke that rule in a couple of places for sanity/speed).
A Transducer is just a type of FSA that does string transforms through character-by-character state processing and replacements (or token by token in other cases). Generally it's single character/token in and zero to many out.
Just out of interest, it's not too hard to build a generalised state machine from data held in tables... Not going to recommend that in this case, as the time to instantiate the state machine would be too high.
Oh, building the state and transition tables is easy in SQL, it's doing the state processing efficiently that's hard.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 18, 2009 at 8:44 pm
[font="Verdana"]Yeah, I did that stuff 20 years ago when I built a compiler. Gotten a bit rusty on the details since then.
I wouldn't build one in SQL! However, I have (in the past) built one that stored the details in SQL. Er, that was over SQL Server 6.5 using VB4 I think. Erk. A while back now.
[/font]
March 18, 2009 at 9:11 pm
Yeah, "do 1 thing, change state, check state, do another thing..." isn't really SQL's forte'. Now "do everything at once", that's something that SQL is good at!
(Edit: Spelling, typos)
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 28, 2009 at 11:41 am
RBarryYoung (3/17/2009)
Hmm, funky new code highlighting script is making my firefox browser jumpy... Not sure how it'll like VB code...Ugh. Not good. I'll have to use the simple-talk prettifier...
Great, now the prettifier isn't working either. 🙁 I'll just have to attach it
OK, my SQLCLR Transducer function is attached.
Bet there is something seriously wrong with the site right now...
Well now that Steve & company have fixed the new format AND given us language labeling as well, I can finally list my VB code in-line:
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Partial Public Class UserDefinedFunctions
Public Enum States
Space1
Entity
HTMLTag
Norm
Word1
Script
Style
End Enum
Enum SubStates
None
EndBegin
EndSlash
End Enum
Const CharSpace As Integer = 32
Const CharAmp As Integer = 38
Const CharSlash As Integer = 47
Const CharLT As Integer = 60
Const CharGT As Integer = 62
Const CharA As Integer = 65
Const CharX As Integer = 88
Const CharZ As Integer = 90
Const Char_a As Integer = 97
Const Char_b As Integer = 98
Const Char_n As Integer = 110
Const Char_p As Integer = 112
Const Char_s As Integer = 115
Const Char_z As Integer = 122
Const CharDash As Integer = 45
Const CharSemiC As Integer = 59
<Microsoft.SqlServer.Server.SqlFunction( _
DataAccess:=DataAccessKind.None _
, IsDeterministic:=True _
, IsPrecise:=True)> _
Public Shared Function HTMLCleaner(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
Dim b As Byte
Dim i As Integer, j As Integer
Dim Out As Byte()
Dim State As States = States.Space1
Dim Substate As SubStates = SubStates.None
Dim strAccum As String = ""
ReDim Out(0 To chars.Length - 1)
For i = 0 To chars.Length - 1
b = chars(i)
Select Case State
Case States.Norm
Select Case b
Case CharA To CharZ, Char_a To Char_z
Out(j) = b
j = j + 1
State = States.Norm
Case CharSpace
Out(j) = b
j = j + 1
State = States.Space1
Case CharAmp
State = States.Entity 'skip output
Case CharLT
Out(j) = CharSpace
j = j + 1
State = States.HTMLTag
Case CharDash
Out(j) = b
j = j + 1
State = States.Norm
Case Else
State = States.Norm 'skip output
End Select
Case States.Space1
Select Case b
Case CharSpace
'discard leading & multiple spaces
Case CharAmp
State = States.Entity 'skip output
Case CharLT
Out(j) = CharSpace
j = j + 1
State = States.HTMLTag
Case CharDash
Out(j) = b
j = j + 1
State = States.Norm
Case CharA To CharZ, Char_a To Char_z
Out(j) = b
j = j + 1
State = States.Word1
Case Else
State = States.Norm 'skip output
End Select
Case States.Word1
Select Case b
Case CharSpace
'sinlge char word, retract from output:
j = j - 1
State = States.Space1
Case CharAmp
State = States.Entity 'skip output
Case CharLT
Out(j) = CharSpace
j = j + 1
State = States.HTMLTag
Case CharDash
Out(j) = b
j = j + 1
State = States.Norm
Case CharA To CharZ, Char_a To Char_z
Out(j) = b
j = j + 1
State = States.Norm
Case Else
State = States.Norm 'skip output
End Select
Case States.Entity
Select Case b
Case CharSemiC
'End of entity, wrap it up:
If strAccum = "nbsp" Then
Out(j) = CharSpace
j = j + 1
strAccum = ""
State = States.Space1
Else
'output "X"
Out(j) = CharX
j = j + 1
State = States.Norm
End If
Case Else
'else, keep scanning for semicolon...
' accumulate entity chars:
strAccum = strAccum & b
End Select
Case States.HTMLTag
If b = CharGT Then
If strAccum = "SCRIPT" Then
strAccum = ""
State = States.Script
Substate = SubStates.None
ElseIf strAccum = "STYLE" Then
strAccum = ""
State = States.Style
Substate = SubStates.None
Else
Out(j) = CharSpace
j = j + 1
State = States.Space1
strAccum = ""
End If
Else
'accumulate tag name
strAccum = strAccum & b
End If
Case States.Script
Select Case Substate
Case SubStates.None
If b = CharGT Then
Substate = SubStates.EndBegin
End If
Case SubStates.EndBegin
If b = CharSlash Then
Substate = SubStates.EndSlash
strAccum = ""
Else
Substate = SubStates.None
End If
Case SubStates.EndSlash
If b = CharGT Then
If strAccum = "SCRIPT" Then
'end of script found; output nothing
State = States.Norm
Substate = SubStates.None
Else
'false alarm, back to script-scanning
Substate = SubStates.None
End If
Else
'accumulate the end-tag's label
strAccum = strAccum & b
End If
End Select
Case States.Style
Select Case Substate
Case SubStates.None
If b = CharGT Then
Substate = SubStates.EndBegin
End If
Case SubStates.EndBegin
If b = CharSlash Then
Substate = SubStates.EndSlash
strAccum = ""
Else
Substate = SubStates.None
End If
Case SubStates.EndSlash
If b = CharGT Then
If strAccum = "STYLE" Then
'end of script found; output nothing
State = States.Norm
Substate = SubStates.None
Else
'false alarm, back to script-scanning
Substate = SubStates.None
End If
Else
'accumulate the end-tag's label
strAccum = strAccum & b
End If
End Select
Case Else
Debug.Assert(1 = 0)
End Select
'extra check for multiple spaces
If j > 1 _
AndAlso (Out(j - 1) = CharSpace _
And Out(j - 2) = CharSpace) Then
j = j - 1 'roll back the last character
ElseIf j = 1 AndAlso Out(0) = CharSpace Then
j = 0 'overwrite leading space
End If
Next
'remove any trailing space:
If j > 0 AndAlso Out(j - 1) = CharSpace Then j = j - 1
'trim off the trailing excess
ReDim Preserve Out(0 To j - 1)
Return New SqlBytes(Out)
End Function
<Microsoft.SqlServer.Server.SqlFunction( _
DataAccess:=DataAccessKind.None _
, IsDeterministic:=True _
, IsPrecise:=True)> _
Public Shared Function HTMLCopy2(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
Dim out() As Byte
ReDim out(0 To chars.Length - 1)
For i As Integer = 0 To chars.Length - 1
out(i) = chars.Buffer(i)
Next
Return New SqlBytes(out)
End Function
<Microsoft.SqlServer.Server.SqlFunction( _
DataAccess:=DataAccessKind.None _
, IsDeterministic:=True _
, IsPrecise:=True)> _
Public Shared Function HTMLCopy(ByVal chars As SqlTypes.SqlBytes) As SqlTypes.SqlBytes
' Add your code here
Return New SqlTypes.SqlBytes(chars.Buffer)
End Function
End Class
Sweet!
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 28, 2009 at 4:38 pm
Man... I just went back and looked at the code I posted... these bloody new code windows added an extra charindex, changed double spaces to single spaces, and did a whole bunch of other nasty things. Not sure what it changed in it, but it also makes Goldie's orginal code not function as she said it would (it leaves out the spaces in the final result).
Steve, if you see this, put it back the old way... I just can't imagine what it's doing to the rest of the code I've posted on this forum.
Barry, here's the test code I'm playing with... the prettifier worked just fine for me...
[font="Courier New"]DECLARE @MyHtml VARCHAR(8000)
SET @MyHtml = '
<HTML><Body>
<b><font face="Courier New" size="3">DOSAGE:</font> Adults--1 to 2
teaspoonfuls. Children six to twelve--1/2 to 1 teaspoonful.
Children two to six years--1/2 teaspoonful. These doses may be
repeated in four hours if necessary but not more than four times in
twenty-four hours.</b> WARNING: Persons with a high fever or
persistent cough should not use this preparation unless directed by
a physician. Do not exceed recommended dosage.<br />
</Body></HTML>
'
SELECT TOP 1000 @MyHtml AS MyHtml
INTO #MyHead
FROM dbo.Tally[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
March 28, 2009 at 5:39 pm
... and here's the code, all cleaned up so it can actually be copied correctly instead of the mess the code window made. Barry, if you test compare against the VB, it should would be nice to know.
I was able to make it just a touch faster than Goldie's version... about 1 second faster every thousand rows... not much.
[font="Courier New"]DROP FUNCTION dbo.JBMStripHTML
GO
CREATE FUNCTION dbo.JBMStripHTML
(@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
--===== Replace all HTML tags with nothing
WHILE CHARINDEX('<',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,
CHARINDEX('<',@HTMLText),
CHARINDEX('>',@HTMLText)-CHARINDEX('<',@HTMLText)+1,
'')
--===== Replace all EntityCodes with space
WHILE CHARINDEX('&',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,
CHARINDEX('&',@HTMLText),
CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)+1,
'')
--===== Replace all special characters (except dashes and spaces) and digits with a space
SELECT @HTMLText = STUFF(@HTMLText,t.N,1,' ')
FROM dbo.Tally t
WHERE t.N <= LEN(@HTMLText)
AND SUBSTRING(@HTMLText,t.N,1) LIKE '%[^- A-Z]%'
--===== Replace single letter words with a space
WHILE PATINDEX('% _ %',@HTMLText) > 0
SELECT @HTMLText = STUFF(@HTMLText,PATINDEX('% _ %',@HTMLText),3,' ')
--===== Replace multiple spaces with a single space
WHILE CHARINDEX(' ',@HTMLText) > 0
SELECT @HTMLText = REPLACE(@HTMLText,' ',' ')
--===== Drop any leading or trailing spaces
RETURN LTRIM(RTRIM(@HTMLText))
END
GO
[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
March 28, 2009 at 6:00 pm
Jeff Moden (3/28/2009)
Barry, here's the test code I'm playing with... the prettifier worked just fine for me...
Yeah, I discovered that the prettifier works OK outside of the code windows. The problem is that it used to work inside of the code windows, so there are a lot of posts out there now that look like Goldie's final version (around post 30 in this thread?).
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 28, 2009 at 6:27 pm
Hmm, I thought that this is what I already tested, but sure, I'll try it just in case it isn't...
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 15 posts - 46 through 60 (of 60 total)
You must be logged in to reply to this topic. Login to reply