March 15, 2009 at 9:25 pm
[font="Verdana"]You might be better off writing something in C# or VB, and linking it in as an assembly. That way you can use the .Net regular expressions.[/font]
March 15, 2009 at 9:32 pm
Hey Bruce,
If they're going to go that far, why not just help write some front-end editing and leave the db out of it? Or is this a batch load of some sort?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 15, 2009 at 9:45 pm
Thanks, got it.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 15, 2009 at 9:55 pm
[font="Verdana"]Bob,
That would be ideal. No idea if they have that option though.
Another option would be to cast it to some form of XML. No idea if that would work, just "thinking outside the square".
T-SQL is a bit limited for in-depth string manipulation, particularly around regular expressions. So many time I have wished for a function that's the equivalent of a REPLACE(), but based around a regular expression.
[/font]
March 16, 2009 at 8:58 am
Goldie Graber (3/15/2009)
Jeff, that looks awesome.I'm going to test it out at work tomorrow.
One part I don't quite understand is this:
--===== Replace all EntityCodes with the actual ascii character (Just to be safe instead of dumping them)
WHILE CHARINDEX('&',@HTMLText) > 0 AND CHARINDEX(';',@HTMLText) > CHARINDEX('&',@HTMLText)
AND CHARINDEX(';',@HTMLText)-CHARINDEX('&',@HTMLText)=4
One of your requirements was to ....
4. Replace all entity codes (“&xxx;”) with X
If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2009 at 9:08 am
Jeff Moden (3/16/2009)
One of your requirements was to ....
4. Replace all entity codes (“&xxx;”) with X
If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).
That's what I thought. Unfortunately, we can't assume that entity codes are exactly 5 characters.
The ones which represent characters in other languages are longer.
Here's the code I'm using now.
-- Replace all other entity codes with an X
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0)
BEGIN
IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')
SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
END
March 16, 2009 at 9:08 am
By the way, Goldie... no special test is required for the non-breaking space because of the way I detect characters that need to be replaced with a space.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2009 at 2:11 pm
Jeff, do you have a test set for this?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
March 16, 2009 at 2:58 pm
I can't seem to get better performance than my original function.
I'm pasting it here for future reference.
[font="Courier New"]SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[GetWordCount]
(@HTMLText VARCHAR(MAX))
RETURNS INT
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
DECLARE @NewLength INT
DECLARE @LengthWithoutSpaces INT
DECLARE @WordCount INT
-- Remove <head>
SET @Start = CHARINDEX('<head>', @HTMLText)
SET @End = CHARINDEX('</head>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 7
WHILE (@Start > 0 AND @End > 0)
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<head>', @HTMLText)
SET @End = CHARINDEX('</head>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 7
END
-- Remove <script> tags and contents
SET @Start = CHARINDEX('<script', @HTMLText)
SET @End = CHARINDEX('</script>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 8
WHILE (@Start > 0 AND @End > 0)
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<script', @HTMLText)
SET @End = CHARINDEX('</script>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 8
END
-- Remove <style> tags and contents
SET @Start = CHARINDEX('<style', @HTMLText)
SET @End = CHARINDEX('</style>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 7
WHILE (@Start > 0 AND @End > 0)
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<style', @HTMLText)
SET @End = CHARINDEX('</style>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 7
END
-- Replace all HTML tags with a space
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0)
BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
END
-- Replace with an actual space
SELECT @HtmlText = REPLACE(@HtmlText, ' ', ' ')
-- Replace all other entity codes with an X
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 )
BEGIN
IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')
SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
END
-- Replace all special characters with a space
DECLARE @CleanString VARCHAR(MAX)
;WITH CTE_TALLY AS
(
-- Create a Tally CTE from 1 to whatever the length of the parameter is
SELECT TOP (LEN(@HTMLText)) ROW_NUMBER() OVER (ORDER BY t1.OBJECT_ID) AS N
FROM MASTER.sys.All_Columns t1
CROSS JOIN MASTER.sys.All_Columns t2
)
SELECT @CleanString = ISNULL(@CleanString,'') + CASE WHEN SUBSTRING(@HTMLText,N,1) LIKE '[~`!@#$%^&*:;()¢<>+_={}|\/?".,''==°÷
0123456789]' OR SUBSTRING(@HTMLText,N,1) LIKE ']' OR SUBSTRING(@HTMLText,N,1) LIKE '[' THEN ' ' ELSE SUBSTRING(@HTMLText,N,1) END
FROM CTE_TALLY WITH (NOLOCK)
WHERE N <= LEN(@HTMLText)
-- Remove single character words
SELECT @CleanString = ' ' + @CleanString + ' '
WHILE PATINDEX('% _ %' , @CleanString) > 0
SET @CleanString = STUFF(@CleanString, PATINDEX('% _ %',@CleanString), 3, ' ')
-- Remove all extra spaces
WHILE CHARINDEX(' ',@CleanString) > 0
SELECT @CleanString = REPLACE(@CleanString, ' ', ' ')
SELECT @CleanString = LTRIM(RTRIM(@CleanString))
-- Calculate word count
SELECT @NewLength = LEN(@CleanString)
SELECT @LengthWithoutSpaces = LEN(REPLACE(@CleanString, ' ', ''))
IF @LengthWithoutSpaces > 0
SET @WordCount = (@NewLength - @LengthWithoutSpaces + 1)
ELSE
SET @WordCount = 0
RETURN @WordCount
END
[/font]
March 16, 2009 at 3:10 pm
RBarryYoung (3/16/2009)
Jeff, do you have a test set for this?
Only the "single line" of HTML Goldie previously attached a couple of posts back...
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2009 at 3:12 pm
Goldie Graber (3/16/2009)
I can't seem to get better performance than my original function.I'm pasting it here for future reference.
Do you have the code for the new function... I know it will likely be nearly identical to the code I posted, but it would be good for me to check.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2009 at 3:16 pm
Jeff Moden (3/16/2009)
Goldie Graber (3/16/2009)
I can't seem to get better performance than my original function.I'm pasting it here for future reference.
Do you have the code for the new function... I know it will likely be nearly identical to the code I posted, but it would be good for me to check.
I posted it in my last post.
Very similar to yours, but not identical.
March 16, 2009 at 6:05 pm
Goldie Graber (3/16/2009)
Jeff Moden (3/16/2009)
Goldie Graber (3/16/2009)
I can't seem to get better performance than my original function.I'm pasting it here for future reference.
Do you have the code for the new function... I know it will likely be nearly identical to the code I posted, but it would be good for me to check.
I posted it in my last post.
Very similar to yours, but not identical.
Heh... well try mine. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2009 at 6:15 pm
Goldie Graber (3/16/2009)
Jeff Moden (3/16/2009)
One of your requirements was to ....
4. Replace all entity codes (“&xxx;”) with X
If you "smear" that requirement across all the other requirements including that of replacing certain symbols and single letter words with spaces, all I did was to first detect the entity code and then change the 3 digits to an ASCII code. For example, & 0 4 1 ; (without the spaces) would be replaced by a right parenthesis. It also allows for an & to precede a ; in the actual text without being converted because it checks for the proper length of the entity code (which is actually 5 but that's the way this type of math works out).
That's what I thought. Unfortunately, we can't assume that entity codes are exactly 5 characters.
The ones which represent characters in other languages are longer.
Here's the code I'm using now.
-- Replace all other entity codes with an X
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0)
BEGIN
IF CHARINDEX(' ',SUBSTRING(@HTMLText, @Start, @Length)) = 0
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, 'X')
SET @Start = CHARINDEX('&', @HTMLText, @Start + 1)
SET @End = CHARINDEX(';', @HTMLText, @Start)
SET @Length = (@End - @Start) + 1
END
What's the maximum length they can be?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 16, 2009 at 7:27 pm
Jeff Moden (3/16/2009)
Heh... well try mine. 😛
I did. Mine was faster 😉
Viewing 15 posts - 31 through 45 (of 60 total)
You must be logged in to reply to this topic. Login to reply