September 7, 2004 at 10:33 pm
I just took on porting some business logic into the DB for a way behind project. Im the first 'db' person on the project and they have all their business logic strewn out all over thier web pages. Anyway... me and my big mouth. Look what they did to store a list for processing later in ASP. They put an HTML table in a column! Pretty clever work around for building db table. Again I assume that this is to pull the value into an html table to process later. I need to parse this thing into a db table... Take a look. ( I had to replace '<' with '<>' to make it display like HTML )
<>>table border=0 cellspacing=2><>tr><>td width=20> <>/td><>td><>b>1. Cincinnati Reds at Houston Astros - Houston Astros -320<>br>2. Montreal Expos at Chicago Cubs - Montreal Expos +270<>br>3. San Francisco Giants at Colorado Rockies - San Francisco Giants -190<>br>4. San Francisco Giants at Colorado Rockies - Under 8.5<>br><>/b><>/td><>/tr><>/table>
There can be from 2 or more entries. This one shows four. Does anyone have any ides how to parse this with TSQL. I recently read that TSQL has the worst reg ex support of all the major vendors. I am comming off an Oracle project and cant find a way to do this. Let me know what you think. Reagards and thanks in advance for any help
September 8, 2004 at 12:21 am
I assume this conversion thing is a one-time only deal? (if not, you definetly do NOT want to parse this junk in Transact SQL every time)
Anyways, true - Transact SQL doesn't deal well, or rather "easy" with this kind of string manipulations, it's not what the language is meant for. However, with some creativity one can get pretty far anyway - it mostly depends on how you attack your problem..
Is this example a single row?
What should the expected result(s) look like?
Table DDL where it should "fit" when done?
/Kenneth
September 8, 2004 at 3:39 am
If this is a one-off job, you can get quite a lot of the way there by extracting all the html code into a single file, open it with your favourite browser, select everything and paste it into notepad - that'll get rid of the markup.
September 9, 2004 at 6:52 am
If this is a one-time job, you can write an ASP page/VB app/whatever to grab all of these "crap" records into a recordset, parse through each one populating the HTML into an MSXML DOMDocument object, then treat the data as XML, extracting the values you need using XPath statements. Probably the cleanest way I would think and pretty much 100% automated. My 2 cents... Steve
September 9, 2004 at 7:45 am
This code is a bit long, but should work as long as the HTML is each row is formatted as you previously specified. I think the key thing is that you don't need to fully parse the HTML, just get the data out of it.
-- This code requires stored procedure uspGetToken be created first. See below.
DROP TABLE tdata
GO
CREATE TABLE tdata
(
tid int IDENTITY(1,1),
thtml varchar(8000)
)
SET NOCOUNT ON
INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds at Houston Astros - Houston Astros -320<br>2. Montreal Expos at Chicago Cubs - Montreal Expos +270<br>3. San Francisco Giants at Colorado Rockies - San Francisco Giants -190<br>4. San Francisco Giants at Colorado Rockies - Under 8.5<br></b></td></tr></table>' )
INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds2 at Houston Astros2 - Houston Astros2 -320<br>2. Montreal Expos2 at Chicago Cubs2 - Montreal Expos2 +270<br>3. San Francisco Giants2 at Colorado Rockies2 - San Francisco Giants2 -190<br>4. San Francisco Giants2 at Colorado Rockies2 - Under 2.5<br></b></td></tr></table>' )
INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds3 at Houston Astros3 - Houston Astros3 -320<br>2. Montreal Expos3 at Chicago Cubs3 - Montreal Expos3 +270<br>3. San Francisco Giants3 at Colorado Rockies3 - San Francisco Giants3 -190<br>4. San Francisco Giants3 at Colorado Rockies3 - Under 3.5<br></b></td></tr></table>' )
INSERT tdata (thtml) VALUES ( '<table border=0 cellspacing=2><tr><td width=20></td><td><b>1. Cincinnati Reds4 at Houston Astros4 - Houston Astros4 -320<br>2. Montreal Expos4 at Chicago Cubs4 - Montreal Expos4 +270<br>3. San Francisco Giants4 at Colorado Rockies4 - San Francisco Giants4 -190<br>4. San Francisco Giants4 at Colorado Rockies4 - Under 4.5<br></b></td></tr></table>' )
DECLARE @tid int, @thtml varchar(8000)
DECLARE @s-2 varchar(8000), @d varchar(100), @token varchar(8000), @linetext varchar(8000)
DECLARE @lineNum int
CREATE TABLE #temp
(
tid int,
lineNum int,
txt varchar(8000),
PRIMARY KEY (tid, lineNum)
)
DECLARE cSrc CURSOR
FOR SELECT tid, thtml FROM tdata
FOR READ ONLY
OPEN cSrc
/*
EXPECTED FORMAT:
<table border=0 cellspacing=2>
<tr>
<td width=20>
</td>
<td>
<b>
1. Cincinnati Reds at Houston Astros - Houston Astros -320
<br>2. Montreal Expos at Chicago Cubs - Montreal Expos +270
<br>3. San Francisco Giants at Colorado Rockies - San Francisco Giants -190
<br>4. San Francisco Giants at Colorado Rockies - Under 8.5
<br>
</b>
</td>
</tr>
</table>'
*/
SET NOCOUNT ON
FETCH cSrc INTO @tid, @thtml
WHILE @@FETCH_STATUS = 0
BEGIN
-- string must begin with '<table'
IF Left(@thtml, 6) <> '<table'
BEGIN
PRINT 'ERROR - text not in expected format.'
END
ELSE
BEGIN
-- if all these data begin with a BOLD tag <b>, first step is just to strip everything up to that point
EXEC uspGetToken @thtml OUTPUT, '<b>', @token OUTPUT
-- next, get everything up to the ending </b> tag
EXEC uspGetToken @thtml OUTPUT, '</b>', @token OUTPUT
-- now, break up @token into separate rows, each of which is followed by a '<br>' tag
SET @lineNum = 0
WHILE @token <> ''
BEGIN
EXEC uspGetToken @token OUTPUT, '<br>', @linetext OUTPUT
IF RTrim(@linetext) > ''
BEGIN
SET @lineNum = @lineNum + 1
INSERT #temp (tid, lineNum, txt) VALUES ( @tid, @lineNum, @linetext )
END
END -- WHILE
END -- ELSE
FETCH cSrc INTO @tid, @thtml
END -- WHILE
CLOSE cSrc
DEALLOCATE cSrc
SELECT * FROM #temp
DROP TABLE #temp
-- *****************************************************************************************
-- *****************************************************************************************
DROP PROCEDURE uspGetToken
GO
CREATE PROCEDURE uspGetToken
(
@parm varchar(1000) OUTPUT,
@delim varchar(100),
@token varchar(1000) OUTPUT
)
AS
SET NOCOUNT ON
DECLARE @spos int
SET @spos = CHARINDEX( @delim , @parm, 1 )
IF @spos = 0
BEGIN
SET @token = @parm
SET @parm = ''
END
ELSE
BEGIN
SET @token = SubString( @parm, 1, @spos - 1)
SET @parm = Substring( @parm, @spos + Len(@delim), Len(@parm) - @spos )
END
GO
September 9, 2004 at 8:59 am
Might this help?
http://www.sqlservercentral.com/scripts/contributions/1078.asp
Haven't used it meself, just saw it one day and saved it.
Accesses a vbscript regexp object, which I have used from VB code with great success.
Good luck.
Vik
September 9, 2004 at 12:58 pm
One other suggestion... if you can display all the data in MS Internet Explorer & paste into Excel, you can probably import from MS Excel to MS SQL after that.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply