October 28, 2011 at 2:21 pm
Guys,
Guys I have a table with a column that has html text. The column with html text is pretty big datatye varchar(max)... I wanted to check if any of you have any function that I can use to Strip out the HTML tags... I saw couple of version online, but it was running too slow..
This is the one I used:
http://cosier.wordpress.com/2008/10/22/tsql-strip-html-function/
Any suggestion is helpful.
Thanks,
Laura
Strip Out 🙂
October 30, 2011 at 12:30 pm
Laura_SqlNovice (10/28/2011)
Guys,Guys I have a table with a column that has html text. The column with html text is pretty big datatye varchar(max)... I wanted to check if any of you have any function that I can use to Strip out the HTML tags... I saw couple of version online, but it was running too slow..
This is the one I used:
http://cosier.wordpress.com/2008/10/22/tsql-strip-html-function/
Any suggestion is helpful.
Thanks,
Laura
Strip Out 🙂
What do you want to do with the characters that remain? Just leave them in the string?
Also, what about individual lines? If there's a new line in the HTML, do you want a new row in the result set?
--Jeff Moden
Change is inevitable... Change for the better is not.
October 30, 2011 at 6:01 pm
is it HTML or XHTML?
also, +1 on Jeff's questions...
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
October 30, 2011 at 7:18 pm
Thanks jeff. Yes I just need the string with no HTML tags. IF there is new line I do not need another line, but need one string for each html string. Hope I am making sense. Thanks again.
October 30, 2011 at 8:16 pm
Understood. Let's nail the requirements right to the floor now. If you have the following string, what precisely do you want to get back?
<data1>somedata1a</data1><data2>somedata2a</data2>
<data1>somedata1b</data1><data2>somedata2b</data2>
<data1>somedata1c</data1><data2>somedata2c</data2>
--Jeff Moden
Change is inevitable... Change for the better is not.
October 30, 2011 at 8:49 pm
Jeff Moden (10/30/2011)
Understood. Let's nail the requirements right to the floor now. If you have the following string, what precisely do you want to get back?
<data1>somedata1a</data1><data2>somedata2a</data2>
<data1>somedata1b</data1><data2>somedata2b</data2>
<data1>somedata1c</data1><data2>somedata2c</data2>
If these strings are loaded in three rows then I need three rows with no html tags... I need to replace the html tags with '' on a column which has html text. The data type is ntext for that sql column.
Thanks a lot.
October 30, 2011 at 9:45 pm
Got it. Thanks.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 30, 2011 at 11:34 pm
I refactored the function that I found at the following URL... look for the answer posted by "dudeNumber4"...
CREATE FUNCTION dbo.StripHtmlTags
(@HtmlText XML )
RETURNS NVARCHAR(MAX)
AS
BEGIN
RETURN (
SELECT contents.value('.', 'nvarchar(max)')
FROM (
SELECT contents = chunks.chunk.query('.') FROM @HtmlText.nodes('/') AS chunks(chunk)
) doc
)
END
;
I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 31, 2011 at 8:51 am
Thanks a lot Jeff. This is a good solution for well formatted html. Thanks a lot again Jeff.
October 31, 2011 at 9:34 am
Jeff Moden (10/30/2011)
I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.
I also tried and it took half as long....
CREATE FUNCTION dbo.StripHtmlTags2
(@HtmlText XML )
RETURNS TABLE
AS
RETURN
(
SELECT
@HtmlText.value('(.)[1]', 'nvarchar(max)') AS result
)
Is this how yours ended up?
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
October 31, 2011 at 9:37 am
Jeff Moden (10/30/2011)
I refactored the function that I found at the following URL... look for the answer posted by "dudeNumber4"...
CREATE FUNCTION dbo.StripHtmlTags
(@HtmlText XML )
RETURNS NVARCHAR(MAX)
AS
BEGIN
RETURN (
SELECT contents.value('.', 'nvarchar(max)')
FROM (
SELECT contents = chunks.chunk.query('.') FROM @HtmlText.nodes('/') AS chunks(chunk)
) doc
)
END
;
I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.
This is pretty slick. Of course for must of the stuff our marketing department puts together it will fail because they seem incapable of putting together actual valid html. They routinely leave off closing tags which will cause this to fail parsing the html string.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
October 31, 2011 at 3:00 pm
mister.magoo (10/31/2011)
Jeff Moden (10/30/2011)
I tried turning it into an iTVF instead of a scalar function and it took twice as long to run.I also tried and it took half as long....
CREATE FUNCTION dbo.StripHtmlTags2
(@HtmlText XML )
RETURNS TABLE
AS
RETURN
(
SELECT
@HtmlText.value('(.)[1]', 'nvarchar(max)') AS result
)
Is this how yours ended up?
No... that's even more clever than the one I found and posted. I'll have to give that I try. Thanks, Magoo.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 6, 2012 at 3:29 pm
Found this code for removing HTML tags from my data, but I am ashamed to say, I have no clue how it works. Could someone point me in the right direction so I can understand what it is doing?
create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
declare @textXML xml
declare @result varchar(max)
set @textXML = @text;
with doc(contents) as
(
select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
)
select @result = contents.value('.', 'varchar(max)') from doc
return @result
end
go
What the heck is chunks.chunk.query?? I consider myself to be a Sr. DBA but this has me stumped. I figure it is a xml function of some type but can't seem to find any information on it. Any help greatly appreicated!!!
Jim
October 7, 2015 at 1:53 pm
I know it has been quite some time since this was posted, but I have a question. The code is looking for an EXACT match to the start tag. However, if you have additional info (e.g. <span style=...> or <div class=...> and </span> or </div> you get a parsing error.
Is there any way to avoid that? I've found some code that parses character by character but we know that is, like 🙁
October 7, 2015 at 8:06 pm
Jim Youmans-439383 (4/6/2012)
Found this code for removing HTML tags from my data, but I am ashamed to say, I have no clue how it works. Could someone point me in the right direction so I can understand what it is doing?create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
declare @textXML xml
declare @result varchar(max)
set @textXML = @text;
with doc(contents) as
(
select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
)
select @result = contents.value('.', 'varchar(max)') from doc
return @result
end
go
What the heck is chunks.chunk.query?? I consider myself to be a Sr. DBA but this has me stumped. I figure it is a xml function of some type but can't seem to find any information on it. Any help greatly appreicated!!!
Jim
chunks.chunk is the name of the derived table and column name created in the FROM clause (as chunks(chunk)), that's all.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply