Equivalant of "Left Join" but in XML

  • I'm pretty good at most things in T-SQL but I'm all thumbs when it comes to XML. I've never written any XML code except to do the occasional CSV concatenation so I could definitely use some help here.

    First, here's a variable that contains some XML data. To summarize, there are 4 customers and only 2 of them have placed some orders...

    DECLARE @XmlData XML

    SET @XmlData ='

    <ROOT>

    <Customer CustomerID="1" FullName="Someone 1">

    <Order OrderID="5" OrderDate="2011-01-01">

    <OrderDetail ProductID="11" Quantity="12"/>

    <OrderDetail ProductID="42" Quantity="10"/>

    <OrderDetail ProductID="53" Quantity="8"/>

    </Order>

    </Customer>

    <Customer CustomerID="2" FullName="Someone 2">

    </Customer>

    <Customer CustomerID="3" FullName="Someone 3">

    </Customer>

    <Customer CustomerID="4" FullName="Someone 4">

    <Order OrderID="6" OrderDate="2011-02-02">

    <OrderDetail ProductID="72" Quantity="3"/>

    </Order>

    </Customer>

    </ROOT>'

    ;

    Parsing the data doesn't seem to be a real problem. When I run the following, I get pretty close to what I want...

    SELECT t.FullRow.value('(../../@CustomerID)','INT') AS CustomerID,

    t.FullRow.value('(../../@FullName)' ,'VARCHAR(20)') AS FullName,

    t.FullRow.value( '(../@OrderID)' ,'INT') AS OrderID,

    t.FullRow.value( '(../@OrderDate)' ,'DATETIME') AS OrderDate,

    t.FullRow.value( '(@ProductID)' ,'INT') AS ProductID,

    t.FullRow.value( '(@Quantity)' ,'INT') AS Quantity

    FROM @XmlData.nodes('ROOT/Customer/Order/OrderDetail') AS t(FullRow)

    ;

    That very nicely gives me...

    CustomerID FullName OrderID OrderDate ProductID Quantity

    ---------- --------- ------- ----------------------- --------- --------

    1 Someone 1 5 2011-01-01 00:00:00.000 11 12

    1 Someone 1 5 2011-01-01 00:00:00.000 42 10

    1 Someone 1 5 2011-01-01 00:00:00.000 53 8

    4 Someone 4 6 2011-02-02 00:00:00.000 72 3

    The reason I say "pretty close" is because that returns only the customers that have orders. I'd like to return the customers that don't have any orders so that I get the following...

    CustomerID FullName OrderID OrderDate ProductID Quantity

    ---------- --------- ------- ----------------------- --------- --------

    1 Someone 1 5 2011-01-01 00:00:00.000 11 12

    1 Someone 1 5 2011-01-01 00:00:00.000 42 10

    1 Someone 1 5 2011-01-01 00:00:00.000 53 8

    4 Someone 4 6 2011-02-02 00:00:00.000 72 3

    3 Someone 3 NULL NULL NULL NULL

    2 Someone 2 NULL NULL NULL NULL

    It doesn't exactly take a rocket scientist to be able to brute-bruce force the correct answer as I've done with the code below...

    WITH

    cteCustomer AS

    (

    SELECT t.Customer.value('@CustomerID', 'INT') AS CustomerID,

    t.Customer.value('@FullName' , 'VARCHAR(20)') AS FullName

    FROM @XmlData.nodes('ROOT/Customer') AS t(Customer)

    ),

    cteFullRow AS

    (

    SELECT t.FullRow.value('(../../@CustomerID)','INT') AS CustomerID,

    --t.FullRow.value('(../../@FullName)' ,'VARCHAR(20)') AS FullName,

    t.FullRow.value( '(../@OrderID)' ,'INT') AS OrderID,

    t.FullRow.value( '(../@OrderDate)' ,'DATETIME') AS OrderDate,

    t.FullRow.value( '(@ProductID)' ,'INT') AS ProductID,

    t.FullRow.value( '(@Quantity)' ,'INT') AS Quantity

    FROM @XmlData.nodes('ROOT/Customer/Order/OrderDetail') AS t(FullRow)

    )

    SELECT c.CustomerID,

    c.FullName,

    f.OrderID,

    f.OrderDate,

    f.ProductID,

    f.Quantity

    FROM cteCustomer c

    LEFT OUTER JOIN cteFullRow f

    ON c.CustomerID = f.CustomerID

    ;

    That, however, seems to be some real overkill. I've had to query the XML structure twice to get data (the information for the customers that have no orders) that I believe we should be able to get in a single pass.

    The reason why I need the customers that have no orders is simple... I need to show the customers that have no orders as well as the ones that do. 😉

    The reason why I need this in a denomalized format might not be so obvious... I need to populate 3 normalized tables (Customer, OrderHeader, and OrderDetail) and the denormalized table I've created from the XML provides all of the necessary references to populate the 3 denormalized tables. I don't need help with that but thought I'd explain just in case someone wanted to know "why".

    So, back to my question... considering the XML data that I've provided, is there any way to get the output (which I've also provided) without have to interrogate the XML twice using a LEFT OUTER JOIN? It's certainly OK if the answer is "No"... I'm just going to hate XML a little more if that's the answer. 😛

    Thanks for the help, folks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • OUTER APPLY is your friend here, too 😀

    Please note I decided to use some totally meaningless alias names so you can easily identify the code that's not your original 😎

    This code will also perform significantly better than the "climb up each node" approach. But I expect you'll perform the million row test anyway 😉

    MERRY Xmas to you and your family!!!!

    SELECT t.FullRow.value('(@CustomerID)','INT') AS CustomerID,

    t.FullRow.value('(@FullName)' ,'VARCHAR(20)') AS FullName,

    x.y.value( '(@OrderID)' ,'INT') AS OrderID,

    x.y.value( '(@OrderDate)' ,'DATETIME') AS OrderDate,

    u.v.value( '(@ProductID)' ,'INT') AS ProductID,

    u.v.value( '(@Quantity)' ,'INT') AS Quantity

    FROM @XmlData.nodes('ROOT/Customer') AS t(FullRow)

    OUTER APPLY

    FullRow.nodes ('Order') x(y)

    OUTER APPLY

    y.nodes ('OrderDetail') u(v)

    Edit: I know it's not a single pass... but with this approach the data will be collected "top down" instead of "bottom up" whereas the latter will climb up the nodes for each and every element at the bottom. If there's a large number of OrderDetails per Order then this will take quite some time...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • One more question:

    How large is the XML structure?

    It might help to insert the data into a table with an indexed xml column.

    Under certain conditions the additional time to create the xml index can be justified.

    My experience is it's mainly driven by the xml structure (e.g. number of nested levels, number of nodes per level) and the complexity of the query itself.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • I suspected that an OUTER APPLY was going to be involved. I just didn't know enough to solve it myself. Nicely done, Lutz. I'm going to have to study this "Top Down" approach because it seems a whole lot easier to understand.

    I'll also look into the indexing you spoke of, as well. Heh... yep... you're right... the million row test will certainly come into play. 🙂

    Thank you very much for your help ol' friend. Merry Christmas to you and yours. To coin a phrase, "Peace on Earth and Goodwill to DBAs."

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's a short example of the table-based version using your original sample data.

    CREATE TABLE #tmpXML

    (

    TmpID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    XmlCol XML

    )

    INSERT INTO #tmpXML (XmlCol)

    VALUES (@XmlData)

    CREATE PRIMARY XML INDEX PXML_#tmp_XmlCol

    ON #tmpXML (XmlCol);

    SELECT *

    FROM #tmpXML

    SELECT t.FullRow.value('(@CustomerID)','INT') AS CustomerID,

    t.FullRow.value('(@FullName)' ,'VARCHAR(20)') AS FullName,

    x.y.value( '(@OrderID)' ,'INT') AS OrderID,

    x.y.value( '(@OrderDate)' ,'DATETIME') AS OrderDate,

    u.v.value( '(@ProductID)' ,'INT') AS ProductID,

    u.v.value( '(@Quantity)' ,'INT') AS Quantity

    FROM #tmpXML

    CROSS APPLY XmlCol.nodes('ROOT/Customer') AS t(FullRow)

    OUTER APPLY

    FullRow.nodes ('Order') x(y)

    OUTER APPLY

    y.nodes ('OrderDetail') u(v)

    DROP TABLE #tmpXML



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • I see you've added "mind reading" to your list of skills. That was going to be my next question if I couldn't suss it on my own. Thanks, Lutz!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/25/2011)


    I see you've added "mind reading" to your list of skills. That was going to be my next question if I couldn't suss it on my own. Thanks, Lutz!

    It's not mind-reading at all... There's a fellow around always posting as much details as possible to make it easy for the OP - that might be one reason for becoming the Exceptional DBA of the year... 😀

    The fact I missed quite some information in my first reply just demonstrate the huge distance between him and me....

    To make it even worse: I forgot to include the "old" OPENXML approach:

    DECLARE @idoc int

    EXEC sp_xml_preparedocument @idoc OUTPUT, @XmlData

    SELECT *

    FROM

    OPENXML (@idoc, '/ROOT/Customer',2)

    WITH (

    CustomerID int '@CustomerID',

    FullName VARCHAR(30) '@FullName'

    )x

    LEFT OUTER JOIN

    OPENXML (@idoc, '/ROOT/Customer/Order/OrderDetail',2)

    WITH (

    CustomerID int '../../@CustomerID',

    FullName VARCHAR(30) '../../@FullName' ,

    OrderID int '../@OrderID',

    OrderDate datetime '../@OrderDate',

    ProdID int '@ProductID' ,

    Qty int '@Quantity'

    )y

    ON x.CustomerID=y.CustomerID

    EXEC sp_xml_removedocument @idoc

    ... as well as some references to further links:

    http://www.sqlservercentral.com/Forums/Topic840158-2573-1.aspx

    http://www.sqlservercentral.com/Forums/Topic417131-338-1.aspx

    http://beyondrelational.com/quiz/sqlserver/general/2010/questions/sqlserver-quiz-general-2010-jacob-sebastian-openxml-or-xquery-when-and-why.aspx

    PS: I know the "Anonymous" user over at beyondrelational rather well 😀



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Gosh, Lutz... you're too kind. :blush: Considering that that "Anonymous" fellow over on BeyondRelational set the tone on that post for everyone else with format, quantity, accuracy, and reasonable explanations at to why he preferred XQuery to OPENXML, I'd have to say that I'm in pretty good company. Considering the time you've spent helping me on this post and the beautifully formatted, super-easy-to-read code you've posted, I'd say I'm in great company. Thank you for the leg-up.

    I had actually sussed the problem using OPENXML in the identical fashion that you did in your fine post. Then I remembered the memory/resource "problems" that were associated with OPENXML and decided to try the XQuery route, hence my question on the subject. Considering that, I understand why you asked how large the XML data was going to be because it appears from the fine links you provided that OPENXML may actually be better for one of the things I'm trying to do... keep from storing huge XML "files" in my databases. Why am I hell bent on keeping XML out of my databases? Simple... very much like a CSV, XML is a delimited, multi-value column that violates normalization and, compared to "normal" data in a properly designed database, it requires a lot of resources to process especially more than once. To wit, I consider it to be even worse than a OTLT (One True Lookup Table). Heh... I guess that qualifies as a bit of a rant, doesn't it? 😛

    I really appreciate the code you've provided and the links. They've got some good information in them that has been difficult to find with Google because I'm probably not searching for the right thing (figuring out what to search for on a subject I don't know) and I get millions of hits for each search. As a bit of a side bar, considering the rather poor documentation in BOL on the subject, I'm not surprised that so many people ask questions and write articles on the subject of XML.

    My questions on the subject are actually to solve a larger problem. I'm actually ticked off that you have to know the names of the attributes and the elements as well as the structure of the XML to be able to shred XML into a working table for "normal" analysis. It all just seems to contrary to what I thought XML was all about. Folks at work have also run into the problem of building a "style" sheet for the XML and then getting data that may not include some of the nodes for the "lower levels" (for one reason or another) causing the SSIS jobs they're running to blow up with a "missing data" or "XML structure" error. I've seen some folk's attempts at building a dynamic shredder and it's obvious that they haven't done any million node testing nor does the code work with more than the 1 or 2 basic levels that a lot of folks seem to limit their XML to.

    With the good help you've provided me in quickly getting up to speed on "multi-level" XML and the things I've taught myself about the underlying "edge" table that you can easily get to with OPENXML, I just might be able to pull off a small T-SQL miracle in the world of XML.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Darn, I wish I would have seen this earlier so I could say I helped Jeff with something SQL related...:-D

  • I'm not sure if the method described in Jacobs Blog would be a start for a dynamic shredding solution...

    There's also a reference to the related XQuery method

    Due to the natural structure of an xml document I don't think there's a solution available that would meet your performance requirements 😎 (Each node has to be shredded separately and the nodes referencing the same "column" need to be filtered).

    But if this is part of an import process to store the data in a relational format for further processing this might be acceptable.

    We're using a shredding method against a 2Mb XML file with a few thousand nodes nested 5 level. It takes about 7 seconds and it is part of our nightly import process. So I can live with the time it takes...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Interesting... I didn't realize you could reference the "/@" level that way. I'll have to give that a try because, if it works, it'll simplify the code a bit.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply