April 23, 2012 at 12:52 pm
Jon,
Sure, you could parse HTML that is already stored in a column.
I wrote a brief post[/url] on this sort of thing at http://sf.net/p/sqldom/discussion/general/thread/5bf1fdb6/
In a nutshell, the #spactDOMLoad procedure accepts an @HTML parameter that contains the HTML to be parsed: it can come from any source, including an existing row in a SQL table.
After calling #spactDOMLoad you could select the data you want to keep either with the #spgetDOM procedure or by directly selecting rows from the temporary tables #tblDOM, #tblDomAttribs, etc.
April 24, 2012 at 3:31 pm
Hello again,
I was curious if it would be possible to to change what it is your code is looking to capture after a tag that looks like this <OPTION selected>Text I want is here</OPTION>. I have ran your code against some test HTML for the form I was speaking about and I do see the option element in your tag column but the problem is it doesn't pick up the selected portion and place it into the tag column as well, which I use to determine which option the user has selected so that I may parse it. Also, when I use HTML: <Option Selected Id=55555 Name=Burr Oak> It fills in the tag column with a blank string into the name column but a NULL in the ID column. Also if I leave the Selected tag information out it not only puts a blank string in for the name column but also does the same for the ID field as well (see screenshot). If I could get it to somehow notice the Selected word inside the <Option Selected> tag and save the actual ID and Name in the result list I could use it to pull the couple pieces of information I need.
Also on a side note, how do you recommend actually running a where against the result set because I would need to look through hundreds of <option> tags to find the couple that say <option Selected> in it?
Thank you in advance.
April 24, 2012 at 10:33 pm
Normally attributes that are parsed by SQLDOM are stored in the table #tblDOMAttributes (which can be joined to table #tblDOM via the DEID column).
In HTML 4, "selected" is really supposed to be attributes like this:
<option selected="selected">My Option</option>
SQLDOM has not support parsing out attributes that are not in the form of "name=value". Consequently, SQLDOM has not parsed out the selected attribute in someting like:
<option selected>My Option</option>
Back in the old days (HTML 2, and possibly HTML 3.2), "selected" with no value was allowed.
Based on your request I modified SQLDOM and posted an updated version (.924) that parses out these old-fashioned quasi-attributes with no values. Please download version .924 from: https://sourceforge.net/projects/sqldom/files/
Once you parse your HTML with version .924 you will be able to retrieve the selected options with something like this:
EXEC #spactDOMLoad @HTML = '<html><body>Hello World.<option selected>Try this</option></body></html>'
SELECT
dom.Tag,
dom.ID,
txt.TextData,
attr.Name,
attr.Value
FROM
#tblDOM dom
LEFT JOIN #tblDOM txt ON
dom.DEID = txt.ParentDEID AND
txt.TextData IS NOT NULL
LEFT JOIN #tblDOMAttribs attr ON
dom.DEID = attr.DEID
April 26, 2012 at 3:40 pm
David,
Thank you very, very much. I should be able to pull off exactly what I am trying to do. I really do appreciate it. This could be very useful for future endeavors for me as well.
Thanks again,
Jon
December 28, 2014 at 12:10 am
Great article. We'll done sir!
I do, however, disagree a little with this line (emph. mine)...
But HTML is often not well-formed XML. Some tags are "singleton" tags--that do not require end tags...
Html 4.01+ (May 2000) is in fact well-formed XML. In other words, if it was developed or generated this century it is almost certainly well-formed. I don't know how much html 4.0 and ealier is still out there but it's certainly the exception and not the rule.
HTML 4.01 DOES require end tags or it can be empty. E.g. this is well formed: and so is this <p/>. This: <p> is neither legit html if its 4.01 or later, nor is it well-formed xml.
I am not trying to take anything away from your excellent article but felt that is worth calling out.
-- Itzik Ben-Gan 2001
December 29, 2014 at 9:50 am
In your article:
This will return a resultset containing the nodes that were parsed:
DEIDDocIDTagIDNameClassTextData
OpenTagStartPos
CloseTagEndPos
ParentDEID
HUIDSortHUIDDOMLevel
11htmlNULLNULLNULLNULL1119NULL111
21headNULLNULLNULLNULL105311.11.0000012
31titleNULLNULLNULLNULL184321.1.1000001.000001.0000013
41NULLNULLNULLNULLSample HTMLNULLNULL31.1.1.1000001.000001.000001.0000014
51bodyNULLNULLNULLNULL5711011.21.0000022
61divNULLNULLNULLNULL6710051.2.1000001.000002.0000013
71NULLNULLNULLNULLHelloNULLNULL61.2.1.1000001.000002.000001.0000014
81bNULLNULLNULLNULL825361.2.1.2000001.000002.000001.0000024
91NULLNULLNULLNULLWorldNULLNULL81.2.1.2.1000001.000002.000001.000002.0000015
for the row DEID
8
the OpenTagStartPos is 82 and the CloseTagEndPos is 53.
Is this a typo?
December 29, 2014 at 1:49 pm
for the row DEID 8
the OpenTagStartPos is 82 and the CloseTagEndPos is 53.
Is this a typo?
Yes..a typo. That row actually returns a br for the tag (not a b), and returns the correct start/end position.
(Copying results and formatting them by hand in markup for article publication is a manual process and subject to typos like this slipping through. Good catch!)
December 29, 2014 at 5:46 pm
This is a really cool exercise. However, it should remain simply that.
SQL is not HTML
HTML is not SQL
I can't believe nobody else has raised a warning flag on this.
Just because SQL Server enables this doesn't mean it is a good idea.
Do you know why this could be bad? What concerns or questions must a developer ask before falling back on something this niave?
It violates almost all of the SOLID principles for software development. SOLID is not only for object oriented coding.
You are mixing presentation, persistence and database layers all in a database engine. What are you going to do when your SQL Server, which costs a LOT of money, begins to have slower performance because it is Converting To/From HTML To/From SQL instead of managing database and database transactions?
Keep writing stuff like this. It keeps me in business rescuing applications designed this way!
December 29, 2014 at 7:02 pm
taylor_benjamin (12/29/2014)
This is a really cool exercise. However, it should remain simply that.SQL is not HTML
HTML is not SQL
I can't believe nobody else has raised a warning flag on this.
Just because SQL Server enables this doesn't mean it is a good idea.
Do you know why this could be bad? ...
Raised a warning flag on what exactly? I suggest that you re-read the first few paragraphs and re-examine your comment.
Nobody is suggesting that you store replace SQL with HTML or vise-versa. There are times when you need to extract data from HTML into your OLTP systems or data warehouse. This article provides an excellent way to do so when required.
-- Itzik Ben-Gan 2001
December 29, 2014 at 7:22 pm
Alan:
I think it is cool that you can do this using SQL.
There are so many better tools to use. Why would you use an engine tuned to manage set data to perform procedural code using MXLDOM? Yes, SQL Server expresses the contents as sets, with a lot of overhead.
If you were simply using a local instance of SQL Server to do this, it wouldn't be as bad. But this kind of design can kill a production SQL Server and keep it from doing what it needs to do efficiently; serve up data.
SSIS was designed to do this kind of thing. There are many other tools. Linq is also another wonderful tool for this kind of activity if you enjoy set like capabilities.
I don't see any warnings given in the editorial regarding why or where this should be used. Perhaps I need to go back and read it more thoroughly.
In my 30+ years of DBA experience, if this were done on an instance of SQL Server intended to store data, it is not a design that can scale. If that instance supports things that need to scale, don't implement this there. Put it on SQL Express, or an in memory instance. Don't put this on a production instance of SQL Server.
I've seen too many systems where databases have been sharded into multiple instances or massive servers were purchased because developers put things on the database server that could be done just as easily using tools that can scale. SQL Server doesn't scale out easily. So, a good rule of thumb is, if it isn't storing, manipulating, or retrieving data, there is probably a better place to do it than on your SQL Server.
I know that is a hard statement. But with a million war stories, many of my own (and my own making), demonstrating why this is a non-production solution.
Let me close by saying I really appreciate your style of writing. it is clear and accurate, easy to understand. Please don't let my criticism discourage you in any fashion. I look forward to seeing more.
December 30, 2014 at 11:38 am
I'll put my bug report here, and thanks for the work on this!
First, a not-a-bug query: This works fine on my Workstation, but fails to get the data from URLs if run on the server. There are no errors reported, so I assume the GET is not returning anything. I can browse to the page with IE on the server fine. Any thoughts on what Server/SQL/Firewall configuration needs to be done?
Now, SQLDOM bug and notes:
sdom.spactDOMClear is missing DELETE on the table #tblDOMHierarchy.
Not necessarily a bug, but tricky to work around what I found in a loop to get multiple URLs:
SELECT @UrlId = 1;
WHILE (@UrlId <= 5)
BEGIN
SELECT @URL = URL FROM #urls WHERE UrlId = @UrlId;
EXEC sdom.sputilGetHTTP
@URL = @URL,
@ResponseText = @HTML OUTPUT,
@SuppressResultset = 1;
-- ***** DOMLoad will error if there is a value in @DocID, even if @CreateNew = 1
SET @DocID = NULL;
EXEC sdom.spactDOMLoad @DocID = @DocID, @HTML = @HTML, @CreateNew = 1;
-- ***** DOMLoad apparently doesn't return @DocID value correctly
SET @DocID = @UrlId;
EXEC sdom.spgetDOM @DocID = @DocID, @SuppressResultSet = 1;
SELECT @UrlId = @UrlId + 1;
END;
Edit: Using SQLDOM_core_persist_927
So long, and thanks for all the fish,
Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3
December 31, 2014 at 9:42 am
Suggestion: Change the spgetInitSession to create the temp tables and then execute user code. This avoids pasting (potentially changeable) SQLDOM code into user's procedures and neatly resolves the temp table scope problem. Note: Set the owner of the sdom schema to an account with permissions to execute the OA_ procedures.
CREATE PROCEDURE sdom.spgetInitSession
@SQLToExecute nvarchar(MAX)
WITH EXECUTE AS OWNER
AS
BEGIN
SET NOCOUNT ON
IF OBJECT_ID('tempdb..#tblDOMDocs') IS NOT NULL BEGIN
DROP TABLE #tblDOMDocs
END
-- REMOVED THE REMAINING SQLDOM TEMP TABLE CREATION CODE
-- ...
-- Now execute the user code
EXECUTE sp_executesql @SQLToExecute;
END
GO
Usage:
DECLARE @SqlToExecute VARCHAR(MAX) =
'
DECLARE @HTML varchar(MAX);
DECLARE @DocID INT;
EXEC sdom.sputilGetHTTP
@URL = ''https://google.com'',
@ResponseText = @HTML OUTPUT,
@SuppressResultset = 0;
-- ***** DOMLoad will error if there is a value in @DocID, even if @CreateNew = 1
SET @DocID = NULL;
EXEC sdom.spactDOMLoad @DocID = @DocID, @HTML = @HTML, @CreateNew = 1;
-- ***** DOMLoad apparently does not return @DocID value
SET @DocID = 1;
EXEC sdom.spgetDOM @DocID = @DocID, @SuppressResultSet = 1;
SELECT * FROM #tblDOMHierarchy;
';
EXECUTE sdom.spgetInitSession @SqlToExecute;
So long, and thanks for all the fish,
Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3
December 31, 2014 at 12:44 pm
FYI, I actually have two variants of SQLDOM--one that creates permanent objects in the database (i.e the "persist" version), and one that creates only temporary objects.
The article really deals with the temporary-only flavor of SQLDOM.
However, the thinking behind the "persist" version is that in some user applications there is a SQL connection that stays alive and remains dedicated to the user session. In a scenario like this, at user session start-up you may want to instantiate the session-specific temporary tables, but have these persist for the life of the session. The only way to accomplish this is to have SQLDOM return the dynamic SQL to the caller so that the caller can execute it on the connection. (Otherwise, the scope of the temporary objects would be limited to the initialization call.)
In fact, I didn't really realize that this article was going to be re-published this month. I have several refinements and updates I have accumulated over the past year that I should post to the SourceForge project. I appreciate and am open to the suggestions, particularly if you and/or others have an interest in using or contributing to the project.
I might suggest that we move the discussion about enhancements to SourceForge, so as to leave this SQL Server Central thread to deal more specifically with the article. Does that make sense?
December 31, 2014 at 2:43 pm
Here's the answer: On the server, running in a job, results were returned but were getting trucated at 512 characters. I needed to add one statement before calling the sproc that gets the data to get the job to not truncate the data:
SET TEXTSIZE 2147483647
Russell Shilling (12/30/2014)
I'll put my bug report here, and thanks for the work on this!First, a not-a-bug query: This works fine on my Workstation, but fails to get the data from URLs if run on the server. There are no errors reported, so I assume the GET is not returning anything. I can browse to the page with IE on the server fine. Any thoughts on what Server/SQL/Firewall configuration needs to be done?
So long, and thanks for all the fish,
Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3
December 31, 2014 at 2:45 pm
Will do. I'll put me feedback there.
David Rueter (12/31/2014)
<snip>In fact, I didn't really realize that this article was going to be re-published this month. I have several refinements and updates I have accumulated over the past year that I should post to the SourceForge project. I appreciate and am open to the suggestions, particularly if you and/or others have an interest in using or contributing to the project.
I might suggest that we move the discussion about enhancements to SourceForge, so as to leave this SQL Server Central thread to deal more specifically with the article. Does that make sense?
So long, and thanks for all the fish,
Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3
Viewing 15 posts - 16 through 30 (of 31 total)
You must be logged in to reply to this topic. Login to reply