Removing tags stored in a ntext field

  • Hello everyone

    I have a table with an ntext field that contains many tags of different types. I want to preserve the text and remove the tags when I retrieve the data with a select. I don’t want to update the table, just select the text without the tags.

    For example:

    '<div> We would like the comments box to be available </div> '

    Would become:

    ‘We would like the comments box to be available’

    The only tool I have found searching this site and others is a UDF and of course I want to avoid being chastised for going RBAR so I thought I’d come to you folks

    I have attached some sample data. If you can point me in a direction, it would be much appreciated.

    create table #damnTags

    (

    NtextField ntext

    )

    insert #damnTags

    select '<div> We would like the comments box to be available on the Bed Check at all times,

    even if a client is not marked Present. Thanks, Tim</div> '

    union

    select '<div> <b>Edited by David Orez on Monday, April 30, 2012 at 9:28 AM</b></div> <div> Payroll Utility</div>'

    select * from #damnTags

  • Do you want to remove the tags in the data, or remove them once you've selected things?

  • Only in the query. I don't want to update the table.

  • In this case, as far as I can see, a UDF is your most straightforward solution. I quite like this one:

    create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as

    begin

    declare @textXML xml

    declare @result varchar(max)

    set @textXML = @text;

    with doc(contents) as

    (

    select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)

    )

    select @result = contents.value('.', 'varchar(max)') from doc

    return @result

    end

    as it at least uses some fancy XML (original here.)

    If your HTML is not well formed (in which case the above function would fail), your best performing solution is likely to be a CLR/Regex-type thing.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Thanks much, Phil for both the code and the link.

    I was leaning toward CLR but this is a third party database and makes me a bit nervous. I can't see what harm it would do but making this kind of an addition/change is the type of thing that can cause finger pointing should strange issues arise (No wonder you're having trouble with our app! Who knows what else you done to the database!!!).

    I believe I will request the blessings of the powers that be and head down that road if allowed. Maybe I can place the CLR function in one database and call it from another.

  • There should be no issue at all if you go down the path you suggest in your last sentence. You are not updating data, merely reading it.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Hi Phil

    I tested the UDF and it works great. But I am having a tough time wrapping my brain around or uncovering what is going on here. You have a CTE whose output is being stripped of the tags and returned by the function but I'm not able to see deeper than that.

    Regarding " select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)", it seems to be turning the string into XML but I can't see where chunk.chunk.QUERY come from (looks like a function). And is "nodes" an attribute of an xml string that can be used as a function?

    Regarding " contents.value('.', 'varchar(max)') from doc", it looks like contents.value refers to the value returned by the CTE but how do the supplied parameters strip the xml string of its tags.

    I'm hoping you or someone else has the time to explain this to me.

  • Yeah, you'll be needing a SQL/XML ninja for that explanation:w00t: I only found that function today while hunting for a solution to your problem and I haven't yet got to the bottom of how it works either. I need to do some dissection.

    I do have a slight head start on you though. There are a whole load of XML-specific methods available to you in SQL Server 2005+ (nodes and query both fall into this category). See here for some info and welcome to a new world: methods in T-SQL, who'd have thought?

    Have fun - I'll post back if I do that analysis.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • Oky Doky.

    Just got back from vacation and wanted to get to this ASAP. I am not optimizing this query, just getting the desired results for now.

    First, no UDF needed. Second, there is a specific best practice I will ignore here; I will call it out in the end.

    If your NtextField was stored in XML format you could do this:

    IF OBJECT_ID('tempdb..#dangTags') IS NOT NULL

    DROP TABLE #dangTags

    CREATE TABLE #dangTags

    (

    NtextField XML

    )

    insert #dangTags

    select '<div> We would like the comments box to be available on the Bed Check at all times,

    even if a client is not marked Present. Thanks, Tim</div> '

    union

    select '<div> <b>Edited by David Orez on Monday, April 30, 2012 at 9:28 AM</b></div> <div> Payroll Utility</div>'

    select CAST((NtextField.query('//text()')) AS NVARCHAR(max)) AS NtextField

    from #dangTags

    Since it isn't you would do this

    IF OBJECT_ID('tempdb..#dangTags') IS NOT NULL

    DROP TABLE #dangTags

    CREATE TABLE #dangTags

    (

    NtextField ntext

    )

    insert #dangTags

    select '<div> We would like the comments box to be available on the Bed Check at all times,

    even if a client is not marked Present. Thanks, Tim</div> '

    union

    select '<div> <b>Edited by David Orez on Monday, April 30, 2012 at 9:28 AM</b></div> <div> Payroll Utility</div>'

    select CAST((CAST(NtextField AS XML).query('//text()')) AS NVARCHAR(max)) AS NtextField

    from #dangTags

    The only difference is...

    This:

    CAST((NtextField.query('//text()')) AS NVARCHAR(max)) AS NtextField

    is now this:

    CAST((CAST(NtextField AS XML).query('//text()')) AS NVARCHAR(max)) AS NtextField

    NTextField.query was changed to

    CAST(NtextField AS XML).query

    Bad habit:

    When doing XPath (within T-SQL or elsewhere) never use the "//" unless you absolutely must.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Aha: XMLSQLNinja - just what the doctor ordered! 🙂

    That's a cool solution - can you explain how it works?

    And if you've got time, perhaps you could also have a go at explaining the 'dog-food' (chunks.chunk) solution?

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • XMLSQLNinja, thank for your time and alternative approach. I'm afraid that the garbage in my data is not well formed and this method stumbles as a result as does the previous one. Regardless, my thanks for your time.

  • jshahan (7/25/2012)


    XMLSQLNinja, thank for your time and alternative approach. I'm afraid that the garbage in my data is not well formed and this method stumbles as a result as does the previous one. Regardless, my thanks for your time.

    That's a bummer.

    In the future consider using the XML datatype. If you can figure out what records are bad and get them fixed then you change that NText column to XML (e.g. export the contents of the target table somewhere, recreate the table using the XML datatype instead of NText, then import the data back in.)

    Once you have an XML column no XML data can be added unless it is well-formed. Note the code below:

    DECLARE @x table

    (

    ID int IDENTITY(1,1),

    xmldata XML

    )

    INSERT INTO @x VALUES ('<xxx>blah blah</xxx>'),('<xxx><yyy>blah blah</yyy></xxx>')

    SELECT * FROM @x

    -- All good up to this point. Now try bad XML:

    INSERT INTO @x VALUES ('<xyz>blah blah</abc>')

    -- This will produce an error

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • It is a third party package overwhich I had no control but even if I did, I didn't know that the xml datatype was capable of that type of constraint. Thanks for the education.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply