Oh man! I cant beleive he did this

  • I just took on porting some business logic into the DB for a way behind project. Im the first 'db' person on the project and they have all their business logic strewn out all over thier web pages. Anyway... me and my big mouth. Look what they did to store a list for processing later in ASP. They put an HTML table in a column! Pretty clever work around for building db table. Again I assume that this is to pull the value into an html table to process later. I need to parse this thing into a db table... Take a look. ( I had to replace '<' with '<>' to make it display like HTML )

    <>>table border=0 cellspacing=2><>tr><>td width=20> <>/td><>td><>b>1. Cincinnati Reds at Houston Astros - Houston Astros -320<>br>2. Montreal Expos at Chicago Cubs - Montreal Expos +270<>br>3. San Francisco Giants at Colorado Rockies - San Francisco Giants -190<>br>4. San Francisco Giants at Colorado Rockies - Under 8.5<>br><>/b><>/td><>/tr><>/table>

    There can be from 2 or more entries. This one shows four. Does anyone have any ides how to parse this with TSQL. I recently read that TSQL has the worst reg ex support of all the major vendors. I am comming off an Oracle project and cant find a way to do this. Let me know what you think. Reagards and thanks in advance for any help

  • I assume this conversion thing is a one-time only deal? (if not, you definetly do NOT want to parse this junk in Transact SQL every time)

    Anyways, true - Transact SQL doesn't deal well, or rather "easy" with this kind of string manipulations, it's not what the language is meant for. However, with some creativity one can get pretty far anyway - it mostly depends on how you attack your problem..

    Is this example a single row?

    What should the expected result(s) look like?

    Table DDL where it should "fit" when done?

    /Kenneth

     

  • If this is a one-off job, you can get quite a lot of the way there by extracting all the html code into a single file, open it with your favourite browser, select everything and paste it into notepad - that'll get rid of the markup.

  • If this is a one-time job, you can write an ASP page/VB app/whatever to grab all of these "crap" records into a recordset, parse through each one populating the HTML into an MSXML DOMDocument object, then treat the data as XML, extracting the values you need using XPath statements.  Probably the cleanest way I would think and pretty much 100% automated.  My 2 cents...  Steve

  • This code is a bit long, but should work as long as the HTML is each row is formatted as you previously specified. I think the key thing is that you don't need to fully parse the HTML, just get the data out of it.

    -- This code requires stored procedure uspGetToken be created first. See below.

    DROP TABLE tdata

    GO

    CREATE TABLE tdata

    (

      tid int IDENTITY(1,1),

      thtml varchar(8000)

    )

    SET NOCOUNT ON

    INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds at Houston Astros - Houston Astros -320<br>2. Montreal Expos at Chicago Cubs - Montreal Expos +270<br>3. San Francisco Giants at Colorado Rockies - San Francisco Giants -190<br>4. San Francisco Giants at Colorado Rockies - Under 8.5<br></b></td></tr></table>' )

    INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds2 at Houston Astros2 - Houston Astros2 -320<br>2. Montreal Expos2 at Chicago Cubs2 - Montreal Expos2 +270<br>3. San Francisco Giants2 at Colorado Rockies2 - San Francisco Giants2 -190<br>4. San Francisco Giants2 at Colorado Rockies2 - Under 2.5<br></b></td></tr></table>' )

    INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds3 at Houston Astros3 - Houston Astros3 -320<br>2. Montreal Expos3 at Chicago Cubs3 - Montreal Expos3 +270<br>3. San Francisco Giants3 at Colorado Rockies3 - San Francisco Giants3 -190<br>4. San Francisco Giants3 at Colorado Rockies3 - Under 3.5<br></b></td></tr></table>' )

    INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds4 at Houston Astros4 - Houston Astros4 -320<br>2. Montreal Expos4 at Chicago Cubs4 - Montreal Expos4 +270<br>3. San Francisco Giants4 at Colorado Rockies4 - San Francisco Giants4 -190<br>4. San Francisco Giants4 at Colorado Rockies4 - Under 4.5<br></b></td></tr></table>' )

    DECLARE @tid int, @thtml varchar(8000)

    DECLARE @s-2 varchar(8000), @d varchar(100), @token varchar(8000), @linetext varchar(8000)

    DECLARE @lineNum int

    CREATE TABLE #temp

    (

      tid int,

      lineNum int,

      txt varchar(8000),

      PRIMARY KEY (tid, lineNum)

    )

    DECLARE cSrc CURSOR

    FOR SELECT tid, thtml FROM tdata

    FOR READ ONLY

    OPEN cSrc

    /*

    EXPECTED FORMAT:

    <table border=0 cellspacing=2>

      <tr>

        <td width=20>

        </td>

        <td>

          <b>

          1. Cincinnati Reds at Houston Astros - Houston Astros -320

          <br>2. Montreal Expos at Chicago Cubs - Montreal Expos +270

          <br>3. San Francisco Giants at Colorado Rockies - San Francisco Giants -190

          <br>4. San Francisco Giants at Colorado Rockies - Under 8.5

          <br>

          </b>

        </td>

      </tr>

    </table>'

    */

    SET NOCOUNT ON

    FETCH cSrc INTO @tid, @thtml

    WHILE @@FETCH_STATUS = 0

    BEGIN

      -- string must begin with '<table'

      IF Left(@thtml, 6) <> '<table'

        BEGIN

          PRINT 'ERROR - text not in expected format.'

        END

      ELSE

        BEGIN

          -- if all these data begin with a BOLD tag <b>, first step is just to strip everything up to that point

          EXEC uspGetToken @thtml OUTPUT, '<b>', @token OUTPUT

          -- next, get everything up to the ending </b> tag

          EXEC uspGetToken @thtml OUTPUT, '</b>', @token OUTPUT

          -- now, break up @token into separate rows, each of which is followed by a '<br>' tag

          SET @lineNum = 0

          WHILE @token <> ''

          BEGIN

            EXEC uspGetToken @token OUTPUT, '<br>', @linetext OUTPUT

         

            IF RTrim(@linetext) > ''

            BEGIN

              SET @lineNum = @lineNum + 1

              INSERT #temp (tid, lineNum, txt) VALUES ( @tid, @lineNum, @linetext )

            END

          END -- WHILE

        END -- ELSE

      FETCH cSrc INTO @tid, @thtml

    END -- WHILE

    CLOSE cSrc

    DEALLOCATE cSrc

    SELECT * FROM #temp

    DROP TABLE #temp

    -- *****************************************************************************************

    -- *****************************************************************************************

    DROP PROCEDURE uspGetToken

    GO

    CREATE PROCEDURE uspGetToken

    (

      @parm varchar(1000) OUTPUT,

      @delim varchar(100),

      @token varchar(1000) OUTPUT

    )

    AS

    SET NOCOUNT ON

    DECLARE @spos int

    SET @spos = CHARINDEX( @delim , @parm, 1 )

    IF @spos = 0

      BEGIN

        SET @token = @parm

        SET @parm = ''   

      END

    ELSE

      BEGIN

        SET @token = SubString( @parm, 1, @spos - 1)

        SET @parm = Substring( @parm, @spos + Len(@delim), Len(@parm) - @spos )

      END

    GO

     

  • Might this help?

    http://www.sqlservercentral.com/scripts/contributions/1078.asp

    Haven't used it meself, just saw it one day and saved it. 

    Accesses a vbscript regexp object, which I have used from VB code with great success.

    Good luck.

    Vik

  • One other suggestion... if you can display all the data in MS Internet Explorer & paste into Excel, you can probably import from MS Excel to MS SQL after that.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply