January 11, 2011 at 4:24 am
XML is not going to work with HTML as content unless you enclose it with CDATA tags, like so...
<Content>
<![CDATA[
<div id="contentcenter">
<p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>
</div>
]]>
</Content>
Steve
January 11, 2011 at 7:53 am
Your sample mainly contains display information. Get the data in XML format (without the display information) and you're all set with Waynes article.
Lutz,
Thank you for that link on the difference btw xml and html. Exactly what I needed.
There was actual data included in the xml file I was sent, but it was mixed in with all that other stuff. The person who sent me the file seemed pretty perplexed that I didn't know how to deal with it and load it into our database. With the comments I have received, I'm still not exactly sure.
January 11, 2011 at 9:28 am
Sorry for the delay guys:
DECLARE @x XML
SELECT @x= CONVERT(xml, BulkColumn, 2) FROM
OPENROWSET(BULK 'C:\data.xml', SINGLE_BLOB) AS x
January 11, 2011 at 9:34 am
SQL-DBA (1/11/2011)
XML is not going to work with HTML as content unless you enclose it with CDATA tags, like so...<Content>
<![CDATA[
<div id="contentcenter">
<p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>
</div>
]]>
</Content>
Steve
Steve,
Do you mean I need to go into the file and type in [CDATA[... where all HTML appears?
Amy
January 11, 2011 at 10:18 am
weharrelson (1/11/2011)
Sorry for the delay guys:DECLARE @x XML
SELECT @x= CONVERT(xml, BulkColumn, 2) FROM
OPENROWSET(BULK 'C:\data.xml', SINGLE_BLOB) AS x
Ahh, I see. You loading the file in as I showed in the article, except that you're taking it straight to a variable. Then you're using that variable in sp_xml_preparedocument.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
January 11, 2011 at 10:28 am
Amy.G (1/11/2011)
Steve,Do you mean I need to go into the file and type in [CDATA[... where all HTML appears?
Amy
Well, run this little test:
declare @xml1 XML, @xml2 XML;
set @xml1 = '<Content>
<![CDATA[
<div id="contentcenter">
<p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>
</div>
]]>
</Content>';
set @xml2 = '<Content>
<div id="contentcenter">
<p>Perform a <b>hardware</b> check with the utility to assure the quality of the system driver, etc., etc.</p>
</div>
</Content>';
select @xml1, @xml2;
@xml1 has the <![CDATA[ ... ]]>, and @xml2 doesn't.
Notice that in @xml2, that all of the HTML has become tokenized. From what I understand about HTML (very little), there are some tokens where the closing token is optional. If you have any of those in the XML file, SQL is probably going to throw an error.
As is stands, since the HTML is tokenized, it might not get in the way of extracting the data. You'd have to test it to see. Basically, if it can be put into an XML datatype (either variable or table column), then you will probably be able to work with it without having to add the CDATA stuff.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
January 11, 2011 at 2:17 pm
Here's a short example to query data like you've posted (please note that I added a few end tags to change it into a valid xml structure as per SQL requirements):
DECLARE @x XML
SET @x='<p>
<div id="content">
<a name="ada-content" />
<div id="browseBox">
<div class="roundedBox_top-left" />
<div class="roundedBox_top-right" />
<div class="licenseTitle">
<h1>
<span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Title[1]">Aboveground Storage Tank (AST) Permits (>1 Million Gallons Total)</span>
</h1>
<div class="licenseDescription">
<span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Description[1]">Minnesota Pollution Control Agency (MPCA)</span>
</div>
</div>
<div class="roundedBox_bottom-buff" />
</div>
<div class="roundedBox_bottom-left" />
<div class="roundedBox_bottom-right" />
<div id="licenseDetail">
<div id="licenseRightColumn">
<h3 style="margin:0;line-height:100%;">Agency contact information:</h3>
</div>
</div>
</div>
</p>'
;WITH cte AS
(
SELECT @x.query('//span') col
)
SELECT c.value('.','varchar(200)') span_result
FROM cte
CROSS APPLY col.nodes('span') T(c)
/* result set
Aboveground Storage Tank (AST) Permits (>1 Million Gallons Total)
Minnesota Pollution Control Agency (MPCA)
*/
January 11, 2011 at 2:37 pm
LutzM (1/11/2011)
Here's a short example to query data like you've posted
I'm really impressed that out of that whole mess you were able to identify the only actual piece of data needed. I have been practicing and got the license title like you did, but then just as I felt I had it licked, a snag:
DECLARE @y XML
SET @y = '
<h1>
<p xmlns:tcm="http://www.tridion.com/ContentManager/5.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/1999/xhtml">
Facilities that have more than one million gallons capacity must obtain an individual permit from the MPCA according to Minnesota Rules Chapter <a href="http://www.revisor.leg.state.mn.us/arule/7001/">7001</a>.4200-4250.
</br>
</br>
Owners of Aboveground Storage Tanks larger than 1,100 gallons must notify the Minnesota Pollution Control Agency (MPCA) of the existence of these tanks. In addition, tank owners are required to notify the MPCA of change of product or change of status of ASTs.
</h1>
'
SELECT
y.data.value('p[1]', 'varchar(500)') as Fees
from @y.nodes('h1') y(data);
I think the "xmlns"tcm..." is the problem, b/c when I delete that part, it works fine. But going through the whole file deleting would go against the whole point of this. Do you know a way to ignore it?
Looking back, I guess I was hoping someone would've said "whoever sent you that file is an idiot, it's impossible to read!". Now I'm getting obsessed with it.
Amy
January 11, 2011 at 3:22 pm
Ok, then we'd need to add the namespace declaration (result as above).
Regarding your comment about the person sending you the file, here's what I'd say:
Whoever sent you that file should remember that SQL Server is still a RDBMS and not a HTML garbage cleaner. But it can do that, too, if there's absolutely nothing else to do.
But, if all you're asking SQL Server to do is to deal with this kind of stuff, you should consider getting a different tool... 😉
The following code is like using a hammer to drive in a screw. :pinch: So, it's for demonstration purposes only, NOT a recommended solution! The recommended solution would be to get clean XML data.
DECLARE @x XML
SET @x=' <p xmlns:tcm="http://www.tridion.com/ContentManager/5.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/1999/xhtml">
<div id="content">
<a name="ada-content" />
<div id="browseBox">
<div class="roundedBox_top-left" />
<div class="roundedBox_top-right" />
<div class="licenseTitle">
<h1>
<span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Title[1]">Aboveground Storage Tank (AST) Permits (>1 Million Gallons Total)</span>
</h1>
<div class="licenseDescription">
<span id="_SE_FLD" _SE_FLD="tcm:Content/custom:Content/custom:Description[1]">Minnesota Pollution Control Agency (MPCA)</span>
</div>
</div>
<div class="roundedBox_bottom-buff" />
</div>
<div class="roundedBox_bottom-left" />
<div class="roundedBox_bottom-right" />
<div id="licenseDetail">
<div id="licenseRightColumn">
<h3 style="margin:0;line-height:100%;">Agency contact information:</h3>
</div>
</div>
</div>
</p>'
;WITH cte AS
(
SELECT @x.query('
declare default element namespace "http://www.w3.org/1999/xhtml";
//span') col
)
SELECT c.value('.','varchar(200)') span_result
FROM cte
CROSS APPLY col.nodes('declare default element namespace "http://www.w3.org/1999/xhtml"; span') T(c)
January 11, 2011 at 3:28 pm
The following code is like using a hammer to drive in a screw. :pinch: So, it's for demonstration purposes only, NOT a recommended solution! The recommended solution would be to get clean XML data.
I'm going to consider this permission to let it go and do other work.
And thanks for that article. It has proved very helpful.
Amy
January 11, 2011 at 4:12 pm
Amy.G (1/11/2011)
...I'm going to consider this permission to let it go and do other work.
And thanks for that article. It has proved very helpful.
Amy
I'm not sure if a forum post would qualify as permission to do other work. ;-):-D
But I'm glad I've been of some kind of help though. 🙂
January 11, 2011 at 10:58 pm
Hello all,
Well finally what we concluded then cannot we import the complex XML data file or what ?.
January 12, 2011 at 12:24 am
Anil Maharjan (1/11/2011)
Hello all,Well finally what we concluded then cannot we import the complex XML data file or what ?.
I think that the only conclusions we can draw are:
1. If the input is well-formed XML then we can import it using the given method.
2. HTML is NOT, in general, well-formed XML
3. (Contentious.) Well-formed XML elements can only contain HTML data if they are within a CDATA block.
Kelsey Thornton
MBCS CITP
February 7, 2011 at 2:34 pm
Supose you had to load 1 or 2 Gb of Xml data from one file into 2 tables. What you'd consider the best choice, OpenRowSet or BCP?
February 7, 2011 at 2:53 pm
brito.santos (2/7/2011)
Supose you had to load 1 or 2 Gb of Xml data from one file into 2 tables. What you'd consider the best choice, OpenRowSet or BCP?
It depends. But details should be discussed in a separate thread... 😉
Viewing 15 posts - 31 through 45 (of 60 total)
You must be logged in to reply to this topic. Login to reply