December 15, 2006 at 11:16 am
I have a text field that is written to by a propritary database. In addition to the data I want, it writes a lot of HTML tags as well, since it uses those to display this field in a pretty format in the application. I want to remove all HTML tags.
I've been playing with this all morning, but my methods aren't working very well. Any ideas? Every single piece of text that I want to strip out is in <>s.
Thanks!
December 15, 2006 at 11:27 am
This is one good way. However there's a faster way if you have only 1 text information per row.
CREATE FUNCTION dbo.fnStripHtmlTags (@Text AS VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @Return AS VARCHAR(8000)
DECLARE @StartPos AS SMALLINT
DECLARE @Loops AS SMALLINT
SET @Loops = LEN(@Text) - LEN (REPLACE(@Text, '<', ''))
WHILE @Loops > 0
BEGIN
SET @StartPos = CHARINDEX ('<', @Text, 1)
SET @Text = STUFF(@Text, @StartPos, CHARINDEX ('>', @Text, @StartPos + 1) - @StartPos + 1, '')
SET @Loops = @Loops - 1
END
SET @Return = @Text
RETURN @Return
END
December 15, 2006 at 12:23 pm
Wow - that did just what I needed and worked amazingly well.
I am in awe!
And even better, I think I understand it. (I'm a total n00b).
Thanks much!
December 16, 2006 at 8:45 am
Heh... my turn Remi... how come no Tally table here? "I'm surprised you didn't think of it"
--Jeff Moden
Change is inevitable... Change for the better is not.
December 16, 2006 at 11:48 am
Old code... didn't have time to rethink and rewrite. But I'm sure you already have the "good" answer ready so what not just post it already !
December 18, 2006 at 12:25 pm
I now remember why I don't have a set based version for this one... I just can't beat the loop into submission . This is the closest I'm come to matching the speeds, but this version takes about 30% longer to run than the loop version!
IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET4' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')
DROP FUNCTION dbo.fnStripHtmlTags_SET4
GO
--this code assumes that more than one section of text can be found for each line.
CREATE FUNCTION dbo.fnStripHtmlTags_SET4 (@Text AS VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @Loops AS SMALLINT
SET @Loops = LEN(@Text) - LEN (REPLACE(@Text, '<', ''))
--SELECT @Text = CASE WHEN CHARINDEX ('<', @Text, 1) * CHARINDEX ('>', @Text, 1) > 0
SELECT @Text = STUFF(@Text, CHARINDEX ('<', @Text, 1), CHARINDEX ('>', @Text, 1) - CHARINDEX ('<', @Text, 1) + 1, '')
FROM dbo.Numbers
WHERE PkNumber <= @Loops
RETURN @Text
END
GO
Those just don't even come close to it (was just having fun at that point ) :
IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')
DROP FUNCTION dbo.fnStripHtmlTags_SET
GO
--this code assumes that more than one section of text can be found for each line.
CREATE FUNCTION dbo.fnStripHtmlTags_SET (@Text AS VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @Length AS SMALLINT
SET @Length = LEN(@Text)
SELECT
@Text = STUFF(@Text, dtMap.StartPos, dtMap.EndPos - dtMap.StartPos, '')
FROM
(
SELECT TOP 100 PERCENT
MAX(dtStart.PkNumber) AS StartPos
, dtEnd.PkNumber AS EndPos
FROM (
SELECT PkNumber
FROM dbo.Numbers
WHERE CHARINDEX ('<', @Text, PkNumber) = PkNumber
AND PkNumber <= @Length
  dtStart
INNER JOIN
(
SELECT PkNumber + 1 AS PkNumber
FROM dbo.Numbers
WHERE CHARINDEX ('>', @Text, PkNumber) = PkNumber
AND PkNumber <= @Length
  dtEnd
ON dtStart.PkNumber < dtEnd.PkNumber
GROUP BY dtEnd.PkNumber
ORDER BY EndPos DESC
  dtMap
RETURN @Text
END
IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET2' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')
DROP FUNCTION dbo.fnStripHtmlTags_SET2
GO
--this code assumes that more than one section of text can be found for each line.
CREATE FUNCTION dbo.fnStripHtmlTags_SET2 (@Text AS VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @Length AS SMALLINT
SET @Length = LEN(@Text)
SELECT
@Text = STUFF(@Text, dtMap.StartPos, dtMap.EndPos - dtMap.StartPos, '')
FROM
(
SELECT TOP 100 PERCENT
PkNumber AS StartPos
, CHARINDEX ('>', @Text, PkNumber) AS EndPos
FROM dbo.Numbers
WHERE CHARINDEX ('<', @Text, PkNumber) = PkNumber
AND PkNumber <= @Length
ORDER BY EndPos DESC
  dtMap
RETURN @Text
END
GO
IF EXISTS (SELECT * FROM dbo.SysObjects WHERE name = 'fnStripHtmlTags_SET3' AND XType = 'fn' AND USER_NAME(uid) = 'dbo')
DROP FUNCTION dbo.fnStripHtmlTags_SET3
GO
--this code assumes that more than one section of text can be found for each line.
CREATE FUNCTION dbo.fnStripHtmlTags_SET3 (@Text AS VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE @Length AS SMALLINT
SET @Length = LEN(@Text)
SELECT
@Text = STUFF(@Text, dtMap.StartPos, dtMap.EndPos - dtMap.StartPos, '')
FROM
(
SELECT TOP 100 PERCENT
PkNumber AS StartPos
, CASE WHEN CHARINDEX ('<', @Text, PkNumber) = PkNumber THEN CHARINDEX ('>', @Text, PkNumber) ELSE NULL END AS EndPos
FROM dbo.Numbers
WHERE CHARINDEX ('<', @Text, PkNumber) = PkNumber
AND PkNumber <= @Length
ORDER BY EndPos DESC
  dtMap
RETURN @Text
END
GO
PS I don't have a create testtable script because I loaded a big html page into a table. I assume you'll be able to find one of those .
December 18, 2006 at 12:33 pm
should u choose regular expressions route for this task, then match <..> substrings with
<[^<>]*>
and then replace the matches with
[empty string]
thus eliminating them from the original string
December 18, 2006 at 1:32 pm
I have no idea of what you are trying to say... can you show a working example with this data?
DECLARE @x AS VARCHAR(8000)
SET @x = ' <table>test3<font>sometext</font>tex2</table>'
SELECT PATINDEX ('<[^<>]*>', @x)
December 18, 2006 at 2:17 pm
I think that's just meant to be some strange gremlin emoticon.
December 18, 2006 at 2:24 pm
Maybe but I still don't get his idea.
I can manually replace the occurance of "<" to ">" to something else but I need to know how to do it faster. Is there any function that acts as a combinaison of patindex and stuff where the whole pattern could be replaced to something else??
If noone can offer something more, then I guess this is one of those exceptions where set based can't even be as fast as procedural.
December 18, 2006 at 3:14 pm
Try this:
DECLARE @Text varchar(8000)
SET @Text = '"X<5"'
SELECT dbo.fnStripHtmlTags (@Text)
Remove space in "< Condition".
_____________
Code for TallyGenerator
December 18, 2006 at 3:37 pm
David I:
It is not a gremlin emoticon. It is a regular expression pattern saying:
- match *<* then a series of characters [0 to N in count] which are NOT *<* or *>*, then match *>*
i.e. exactly what original poster had in mind.
December 18, 2006 at 3:41 pm
Remi,
With <[^<>]*>, I just gave a generic regular expression for matching a generic <…> tag.
Actual implementation of the Regex will be different in SQL Server 2005 and SQL Server 2000.
In SQL Server 2000 that doesn’t have native Regex support, one can resort to Server 2000 Extended Stored Procedure
http://www.codeproject.com/managedcpp/xpregex.asp
to be able do match/replace operation using xp_regex_replace procedure (cf. the link)
In SQL Server 2005 u can create (to be cont.).....
December 18, 2006 at 3:45 pm
I'm pretty sure I was kidding when I posted that but it was so long ago that I forget
December 18, 2006 at 3:46 pm
For normal simple mind people (like me, for example) everyone who understands regular expressions so well is kind of gremlin.
And you are outrageous in it.
_____________
Code for TallyGenerator
Viewing 15 posts - 1 through 15 (of 22 total)
You must be logged in to reply to this topic. Login to reply