Automatic XML

Question

Automatic XML

Jeff Moden

SSC Guru

Points: 1004704
More actions
August 23, 2011 at 9:00 pm

#246862

I've been trying to teach myself XML and ran across a problem I just can't seem to find a way to solve easily.
First, I got the data (it's contained in a file in the example) and the XML from the following fine post by Mr. Plamen Ratchev...
http://pratchev.blogspot.com/2008/11/import-xml-file-to-sql-table.html
Here's the data example he gives...
<Products>
<Product>
<SKU>1</SKU>
<Desc>Book</Desc>
</Product>
<Product>
<SKU>2</SKU>
<Desc>DVD</Desc>
</Product>
<Product>
<SKU>3</SKU>
<Desc>Video</Desc>
</Product>
</Products>
Now, the next thing he does is a very clever shredding of the XML using the following code...
SELECT X.product.query('SKU').value('.', 'INT'),
X.product.query('Desc').value('.', 'VARCHAR(30)')
FROM (
SELECT CAST(x AS XML)
FROM OPENROWSET(
BULK 'C:\Products.xml',
SINGLE_BLOB) AS T(x)
) AS T(x)
CROSS APPLY x.nodes('Products/Product') AS X(product);
Although that works VERY well and does precisely what it was designed to do, it requires one to first figure out what the tags are in the XML file so that you can map the column names. It also assumes that each "row level" has an 'SKU' and a 'Desc' attribute. If the vendor providing the data suddenly add an attribute or, perhaps some "sub attributes" to (say) the SKU attribute, the code must be changed to match. For example, if the vendor added a 'Description' attribute as a sub to the 'SKU' attribute, the code would need to change to include the new 'Description' sub-attribute in the result set.
When generating XML from a table, you can tell it to automatically include everything in a given table by using FOR XML AUTO.
Is there a way to shred XML that would read the file, figure out what all the attributes are automatically, and display them in nice tidy result set (other than the OPENXML method that returns the "edge" table)? I sure could use an example and/or a link if there is because I'm just not finding such a thing in Google.
Thanks for your help, folks.
--Jeff Moden
RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
Change is inevitable... Change for the better is not.
Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply

R.P.Rozema SSChampion Points: 12317 More actions · Answer 1

I do know of a way to get the names of the available elements in the product element. But how to apply that to then retrieve those elements/attributes, that I don't know.

declare @xml xml;

select @xml = N'

<Products>

<Product>

<SKU>1</SKU>

<Desc>Book</Desc>

</Product>

<Product>

<SKU>2</SKU>

<Desc>DVD</Desc>

</Product>

<Product>

<SKU>3</SKU>

<Desc>Video</Desc>

</Product>

</Products>';

select distinct e.e.value('local-name(.)','varchar(128)')

from @xml.nodes('/Products/Product') p(p)

cross apply p.p.nodes('*') e(e);

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 2

Thanks for trying, R.P. That gives me a couple of ideas.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

R.P.Rozema SSChampion Points: 12317 More actions · Answer 3

B.t.w. in this post I gave some pointers on how to set up Kerberos delegation to make the openrowset function even more useful by making it read the documents from a share on f.e. your client without having to create a huge security hole on your server.

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

R.P.Rozema SSChampion Points: 12317 More actions · Answer 4

And a note on your example. I don't know if it was intentional but it seems you're getting the values from the elements a little to complicated: you don't need to first query() and then get the value, you can just get the value directly from the iterator. You just need to make sure to give it a single node to read from, not a (potential) set of nodes called 'SKU'. Hence the 'SKU[1]', meaning just the 1st element called 'SKU'.

In fact, the query('SKU') seems to return a single node, but actually what it does is concatenate all 'SKU' elements in the Product element and then return you those as a single node. i.e. if there were 2 SKU elements in a Product element, you would get the 2 values in one column concatenated. Probably not what you intended.

SELECT X.product.value('SKU[1]', 'INT'),

X.product.value('Desc[1]', 'VARCHAR(30)')

FROM (

SELECT CAST(x AS XML)

FROM OPENROWSET(

BULK 'C:\Products.xml',

SINGLE_BLOB) AS T(x)

) AS T(x)

CROSS APPLY x.nodes('Products/Product') AS X(product);

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

LutzM SSC Guru Points: 107049 More actions · Answer 5

Jeff,

here are some additional sources you might want to have a look at:

Peso@SSC -> a similar approach like the one R.P.Rozema suggested and

Jason Follas blog. Jasons approach is based on the FLWOR syntax.

Both solution won't detect attribute names/values though. Let's see, if there's a solution out there already that'll find its way to this thread. I might find some time to look at it later on today, too.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

R.P.Rozema SSChampion Points: 12317 More actions · Answer 6

Attributes can be found using @*. So the names of these attributes can be found like this:

declare @xml xml;

select @xml = N'

<root>

<node attr1="1" attr2="2"/>

<node attr1="1" attr3="3"/>

</root>

';

select a.a.value('local-name(.)','varchar(128)')

from @xml.nodes('/root/node') r(n)

cross apply r.n.nodes('@*') a(a)

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

LutzM SSC Guru Points: 107049 More actions · Answer 7

R.P.Rozema (8/24/2011)
Attributes can be found using @*. So the names of these attributes can be found like this:
declare @xml xml;
select @xml = N'
<root>
<node attr1="1" attr2="2"/>
<node attr1="1" attr3="3"/>
</root>
';
select a.a.value('local-name(.)','varchar(128)')
from @xml.nodes('/root/node') r(n)
cross apply r.n.nodes('@*') a(a)

That solves he puzzle only partially. What if we don't know the element where an attribute might have been added?

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

R.P.Rozema SSChampion Points: 12317 More actions · Answer 8

like this?

declare @xml xml;

select @xml = N'

<root>

<node attr1="1" attr2="2"/>

<parent attr3="3">

<node attr4="4" attr5="5"/>

</parent>

</root>

';

select a.a.value('local-name(.)','varchar(128)')

from @xml.nodes('/root/node') r(n)

cross apply r.n.nodes('//@*') a(a)

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

R.P.Rozema (8/23/2011)
B.t.w. in this post I gave some pointers on how to set up Kerberos delegation to make the openrowset function even more useful by making it read the documents from a share on f.e. your client without having to create a huge security hole on your server.

That will certainly turn out to be useful. Thanks for the extra leg-up, R.P.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 10

LutzM (8/24/2011)
Jeff,
here are some additional sources you might want to have a look at:
Peso@SSC -> a similar approach like the one R.P.Rozema suggested and
Jason Follas blog. Jasons approach is based on the FLWOR syntax.
Both solution won't detect attribute names/values though. Let's see, if there's a solution out there already that'll find its way to this thread. I might find some time to look at it later on today, too.

Thanks for the response, Lutz. I'll take a look at the links you provided.

Heh... since I've not been able to find anything that actually does what I want it to do, I may just have to write a method to do it and follow it up with an article of my own. 😛

I know it's a strange request but it's easier for me to handle a relatively "unknown" result set than it is to pre-inspect and write specific XML for each of the files that I know I'm going to have to deal with in an upcoming project.

As a side bar... I'm truly disappointed in XML in general. I thought the whole purpose of transmitting tags was so that you'd automatically "know" what's in the file. I realize I'm just getting into it but I currently see no way to identify columnar datatypes nor easily do what I asked. Worse yet, no real space is saved over a denormalized delimited file and the hierarchical nature of XML data is a real PITA compared to denormalized delimited files if you want to do something outside the proverbial box.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 11

LutzM (8/24/2011)
Jeff,
here are some additional sources you might want to have a look at:
Peso@SSC -> a similar approach like the one R.P.Rozema suggested and
Jason Follas blog. Jasons approach is based on the FLWOR syntax.
Both solution won't detect attribute names/values though. Let's see, if there's a solution out there already that'll find its way to this thread. I might find some time to look at it later on today, too.

The approaches taken by "Jason Follas" and "Ken Simmons" (on the other link) seem, at first glantz, to be real close to what I want to do. I'll take a look at these in some detail tonight. Thanks for the links, Lutz.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 12

Any chance to post a sample or two and your expected output?

You could also drop me a PM / email if you like.

The data type issue is definitely something a standard XML file will not cover (attribute values will always be quoted, even if numeric and element values may not need to be quoted, even if varchar). Data size is also something XML doesn't really care about...

The best advice would be to ask your client to provide the XSD schema file together with the xml data. It'll include all you need to know to extract the data without actually looking into the xml file. The schema will include tag names, data type, general structure. And it's (usually) easy to get the data back into a relational model.

When creating XML data off SQL Server, one could use the XMLSCHEMA option to include the XSD schema in the XML file.

There are several sources to create an xsd schema for a given xml file (google: "generate xsd from xml").

This might be an easier approach than writing a new XML shredder. But it sounds like fun though... 😎

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 13

Thanks, Lutz. I'll put a couple of samples together tonight.

Understood on the XSD schema files but the clients are actually relatively ignorant of XML and SQL. They're using different 3rd party apps (the makers of which also appear to be relatively ignorant of XML :-P) to generate the XML. The "XML exports" these apps do seem to have been more of an after-thought rather than anything well planned. With that thought in mind, there's probably no chance of getting an XSD file out of them (although I WILL ask).

I also concerned a bit about the performance of "FOR $node" mentioned in some of the links you proivided... Microsoft hasn't done well in the area of Recursive CTE's nor HierarchyID code nore the enhancements to the OVER clause in CTP3 of Denalli for performance. The concern comes from knowing that many of the files I'll be receiving will have some very wide "MAX" elements and millions of "records" never mind actual rows.

I'll give it all a whirl tonight as well as providing you with some simplified examples of the output I'm trying to get.

Thanks for your help.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

R.P.Rozema SSChampion Points: 12317 More actions · Answer 14

Are you sure you need a totally dynamic parser for your data? In the cases I've seen so far, there were indeed -over time- changes in the format of files delivered and I would hate to have to rewrite the code to read those documents with every such change. For this reason what I do is I create some meta data structure that describes the format I need to be able to parse -at that time-. Then I generate from/using this structure the code to process the data. This results in fast, static processing code (i.e. doesn't need to be (re-)compiled every process pass) between changes in the meta data. Plus, I don't need to rewrite the entire code every time a change is implemented in the document. Plus, I can make improvements on the entire code set very fast by adapting the generator and re-generating the entire code set from the meta data.

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?