Extract the titles from XML

  • Eirikur Eiriksson (7/13/2015)


    Jeff Moden (7/13/2015)


    The correct answer isn't actually available. The correct answer should be to "hunt down the moroff that used XML to transmit purely flat data and introduce him to high velocity pork chops".

    As it is and without the leading spaces, this file contains 487 characters. If it were converted to a columnar flat file without headers, it would reduce to just 167 characters to be roughly only 1/3rd the size. Adding a single row of headers wouldn't do much to change that.

    People crank on about how slow I/O is. You won't find hardware that loads your data in a third of the time so make sure that it's mostly data you're sending or receiving instead of the tag bloat in XML especially for such flat data.

    Shall we set up some tests to see how these two actually compare?

    😎

    Sure!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Steve Jones - SSC Editor (7/13/2015)


    Storing data in XML ...{snip}... allows for flexibility in format, something that tends to be more problematic with flat files.

    People keep saying that but I've not see that to be true, yet. Do you have an example?

    However it does belong in the db, and there are valid cases for it. I might have data like this list of books, perhaps it's old data and not really queried, but I want it around. Perhaps I shred the title out of the XML and store that relationally, but I have a large amount of other data that I keep in XML as a blob. I can query it if I need it, or shred it, but I can also just store it as archival data.

    I guess my question would be, have you actually used it for such a thing? Again, an example would add to the discussion quite a bit.

    I have seen people store XML and for the very reasons you mention. To wit, a lot of people have used it as their solution for column level auditing of store-only-changes to data... and they play absolute hell both with method and performance when trying to reassemble point-in-time (PIT) values for a 140 column table. At best, it usually only ties with and is usually worse than an EAV audit table in areas of storage, accessibility, and manipulation back to PIT row values.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • In my experience, the inefficiency of flat files comes from my time spent debugging ETL packages when boneheaded third parties don't know how to format them correctly. I just love tab-delimited files from systems where users like to paste data from Excel with trailing tabs. One recent case was a file with quote-delimited fields that used \" instead of "" for embedded quotes, plus another column with multi-line text using enescaped line breaks. In another case I had to resort to a regular expression parser to pull out the columns I needed.

    I will happily take XML over having to deal with regular expressions or write custom parsers any day. If they screw it up, at least you can point to official XML standards to make it clear the problem is on their end.

    Back in the stone ages you got fixed-field files written in COBOL and you knew the format could be relied on. (It would take an act of God to make the simplest change, plus at least six months development.) Later maybe they called it EDIF and you had an encyclopedia specifying every minute detail of the format. Nowadays any moron with Excel thinks they know how to write CSV files, and doesn't see why it's a big deal if they add a column or reword the headings.

    If the fact that a background job has to burn more bandwidth and disk space (while you get on with your life) bothers you more than idiots that can't follow their own format specs, then I envy you. I will still insist on XML first until I find out whether I'm dealing with someone intelligent.

  • Eirikur Eiriksson (7/13/2015)


    SQLRNNR (7/13/2015)


    Maybe so, but look at the XML plans for the larger data set versus the small data set - same plan. By using the text() it eliminates that function call which is deadly in XML.

    The plans in these cases are an indication of how the work is done, not a quantification of the effort, all you'll see are compute scalar and xml reader TVFNs, there are no statistics or other cardinality informations for the server to work on. More complex XML generally implies more complex tabular destinations, things get really complex really quickly when tracking relations, order of appearance, external (outside the XML) relations/references etc.

    😎

    Not sure I follow on the quantification of effort.

    I see stats, memory grant info, cpu requirements, duration, number of pages, cost (granted it is based off some smoke and magic), number of rows, and even an implicit conversion warning that affects the cardinality estimate.

    This is true for the version with the UDX call as well as the version without the UDX. What am I missing that helps quantify the work?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Scott Coleman (7/13/2015)


    In my experience, the inefficiency of flat files comes from my time spent debugging ETL packages when boneheaded third parties don't know how to format them correctly.

    +100

    :hehe:

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (7/13/2015)


    I see stats, memory grant info, cpu requirements, duration, number of pages, cost (granted it is based off some smoke and magic), number of rows, and even an implicit conversion warning that affects the cardinality estimate.

    The memory grant is one part of the story. Was it sufficient, or did something spill during execution. Since SQL 2012, hash and sort warnings are visible in the plan. Older versions, not the case.

    CPU usage as reported by SET STATISTICS TIME is a lie. It does not show CPU usage during execution of user-defined functions. Both plans have Table Valued Function operators, so I would not trust this metric.

    Duration is actually represented accurately, but you do not know if during that time you had one cpu mostly waiting (and helping other connections in between), or 16 cores hammering at full pace.

    Number of pages as reported by SET STATISTICS IO - same problem as CPU time.

    Cost - this is always the estimated cost. And if you compare some of the estimated vs actual row counts in the plans, you will now just how reliable that estimate is.

    Number of rows - obviously the same for both plans, otherwise we would be comparing apples to sheep.

    Implicit conversion - the warning says that it MIGHT affect the cardinality estimate, not that it does. This warning always pops up (on SQL2012 and later) when there is any implicit conversion in a plan.

    SQL Server tends to throw a lot of numbers in our face, but unless you know exactly what they are and how to read them, they can confuse more than contribute,

    For performance tuning I *mostly* go by elapsed (wallclock) time. Often not even using SET STATISTICS TIME, but simply storing CURRENT_TIMESTAMP in a variable before execution and then doing a DATEDIFF after. However, on a system with high concurrency and a high overall pressure on memory and/or CPU, I would look at metrics better suited at *their* bottleneck for overall performance.

    It's also important to test with enough data, and with representative data. For instance, do you still see the same numbers if you have to extract not one but a dozen columns from the XML? (This is actually a situation I was recently faced with at a customer site, where millions of XML-formatted data records per day were imported in the database and had to be shredded and stored relationally - their process was so inefficient that they had to use multiple parallel processes to keep up with the incoming data during peak hours; I managed to speed it up to where their business can grow by a factor of four or more before they will have new problems. But that took a lot of effort to try all the different versions of shredding the data, until I had found the fastest method for their data and their XML schema to relational schema translation).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • My biggest XML import job involves 1GB+ files. I never had the nerve to see what happens if you load that much data into an XML variable and try to shred it, at least not on a 32-bit server. Instead I used an XmlReader in a C# program to read the file element by element, adding the elements to a DataRow, the rows to a DataTable, and every 5000 rows write them to the database with a SqlBulkCopy object. The job takes 30 minutes to transfer 2.5 million rows, including the time for the source system to write the XML files.

    In this case the XML format was very simple (the word "flat" might apply) and tightly controlled, so this was a simple program to write.

    So I still prefer XML for ETL files, but I'd like to avoid having to tune queries for shredding XML with millions of leaf nodes.

  • Jeff Moden (7/13/2015)


    The correct answer isn't actually available. The correct answer should be to "hunt down the moroff that used XML to transmit purely flat data and introduce him to high velocity pork chops".

    As it is and without the leading spaces, this file contains 487 characters. If it were converted to a columnar flat file without headers, it would reduce to just 167 characters to be roughly only 1/3rd the size. Adding a single row of headers wouldn't do much to change that.

    People crank on about how slow I/O is. You won't find hardware that loads your data in a third of the time so make sure that it's mostly data you're sending or receiving instead of the tag bloat in XML especially for such flat data.

    I hate to disagree with you Jeff, but in this case, I feel I need to.

    The XML in the question was formatted in a user friendly way, as you would expect for a question, but could just as easily have been formatted with attributes instead of elements, for more efficient packing. This would make its actual data length (assuming Unicode is used as is normal these days) 478 bytes (239 characters), compared to a CSV format of the same data which runs to 392 bytes (196 characters).

    This makes the CSV format 86 bytes smaller for this sample data, which is a reduction of ~18%, not the claimed 66%.

    And, as an additional point for consideration, the same data in JSON is 528 bytes (264 characters)...

    I just wanted to point that out, as it stood out to me as unsupported, but I do agree that flat file is more compact and efficient for storage/transmission.

    Now, none of this is even a major consideration for me, as most of the work I do is using CSV / flat files and I don't care if someone uses XML, except - as has been stated before - when the source of the data is not me.

    If you look at the number of questions on here where the OP cannot even provide sample data that is consistent and correct, and these are SQL professionals, then it becomes a "no brainer" for me - the person supplying me data can provide it as typed XML that withstands a validation check against a published schema or it will be rejected.

    JSON is OK, but it is no better than XML as far as I can see and I really doubt SQL is going to handle it any better than it does XML, so there's no point holding out for that.

    For me, the troubles I've had with third party flat file data have been enough to push me towards more structured formats, such as XML but that is because the size of the data has never been a big issue. If I had issues with the size of the data, then I would likely be compressing it anyway, not just sticking it in a flat file.

    declare @x xml ='<B T="Personal" A="Lee Child" D="20150428" P="Dell"/><B T="Never Go Back" A="Lee Child" D="20120312" P="Dell"/><B T="Ready Player One" A="Ernest Cline" D="20110416" P="Broadway"/><B T="The Martian" A="Andy Weir" D="20140211" P="Broadway"/>'

    declare @f nvarchar(max) = 'T,A,D,P

    "Personal","Lee Child",20150428,"Dell"

    "Never Go Back","Lee Child",20120312,"Dell"

    "Ready Player One","Ernest Cline",20110416,"Broadway"

    "The Martian","Andy Weir",20140211,"Broadway"

    '

    declare @j-2 nvarchar(max) = '[{"T":"Personal","A":"Lee Child","D":"20150428","P":"Dell"},{"T":"Never Go Back","A":"Lee Child","D":"20120312","P":"Dell"},{"T":"Ready Player One","A":"Ernest Cline","D":"20110416","P":"Broadway"},{"T":"The Martian","A":"Andy Weir","D":"20140211","P":"Broadway"}]'

    select @x, datalength(convert(nvarchar(max),@x));

    select @f, datalength(@f);

    select @j-2, datalength(@j);

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Jeff Moden (7/13/2015)


    However it does belong in the db, and there are valid cases for it. I might have data like this list of books, perhaps it's old data and not really queried, but I want it around. Perhaps I shred the title out of the XML and store that relationally, but I have a large amount of other data that I keep in XML as a blob. I can query it if I need it, or shred it, but I can also just store it as archival data.

    I guess my question would be, have you actually used it for such a thing? Again, an example would add to the discussion quite a bit.

    I have seen people store XML and for the very reasons you mention. To wit, a lot of people have used it as their solution for column level auditing of store-only-changes to data... and they play absolute hell both with method and performance when trying to reassemble point-in-time (PIT) values for a 140 column table. At best, it usually only ties with and is usually worse than an EAV audit table in areas of storage, accessibility, and manipulation back to PIT row values.

    We have just such an usage.

    Third party data comes into the business (keeping it vague) as record based flat files (record prefixes on each line, fixed width per line), and these files are controlled by that third party, so they modify the layout fairly regularly - could be multiple changes per quarter, and the changes only affect new files produced from that date, so they are not backward compatible.

    The files are processed and stored as xml by one process (T-sql) in a well defined format that can be used internally.

    This one process is modified to cope with any format changes.

    The nature of the files being multiple records of varying definition (one file may have 100s of lines/records, and each record could be a different (but recognised) format), there would be a massive job to maintain individual tables (which would need to cope with changes to the definition) for each type of record, especially as we only need about 60-70% of the data from these files.

    So, they are transformed into XML and stored in the database. Other processes can then extract the valid XML and shred the data for storage in the main database tables. Usually, for 99% of the cases, it is extracted once and never needed again, but there are cases where we need to re-pull that data and re-process it into the database (usually for a new business case that needs something different from the data).

    We used to just keep the flat files, but once you have enough of them (millions per year) even the best folder structure in the world starts to become a bottleneck in the retrieval process (oh yes, the file names are not unique either, so we need to extract identifiers from the data).

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Jeff Moden (7/13/2015)


    Steve Jones - SSC Editor (7/13/2015)


    Storing data in XML ...{snip}... allows for flexibility in format, something that tends to be more problematic with flat files.

    People keep saying that but I've not see that to be true, yet. Do you have an example?

    However it does belong in the db, and there are valid cases for it. I might have data like this list of books, perhaps it's old data and not really queried, but I want it around. Perhaps I shred the title out of the XML and store that relationally, but I have a large amount of other data that I keep in XML as a blob. I can query it if I need it, or shred it, but I can also just store it as archival data.

    I guess my question would be, have you actually used it for such a thing? Again, an example would add to the discussion quite a bit.

    I have seen people store XML and for the very reasons you mention. To wit, a lot of people have used it as their solution for column level auditing of store-only-changes to data... and they play absolute hell both with method and performance when trying to reassemble point-in-time (PIT) values for a 140 column table. At best, it usually only ties with and is usually worse than an EAV audit table in areas of storage, accessibility, and manipulation back to PIT row values.

    I don't STORE my raw XML for any length of time, but we use it as a major interchange format. I work in insurance, so the messages I deal with are snapshot of our policy transactions, with data elements describing all aspects of the coverage you purchased, what it applies to, the information we know about what you've asked us to cover, and a lot of other analytical data we've used to derive the coverage. The file also includes what info we need to send to you, how much your bill is, how much we need to pay your agent for helping you out, whether we need to pay the bank for your car loan in case of an accident, etc....

    Trying to capture this in a flat file format would required 90+ files to be created at the same time, with extensive links to data in the other files to incorporate the various relations between entities. XML dramatically reduces that complexity by creating the linkages, enforcing data types, etc... Yes - the file is a bit bigger than what I might have gotten if I squeezed out all of the extra stuff using CSV, but frankly I get it back by not having to navigating the maze to get a full picture of my billing data/ claims coverage content, etc.... all of which require data in a mostly relational form , covering multiple entities and levels of data.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Jeff Moden (7/13/2015)


    Steve Jones - SSC Editor (7/13/2015)


    Storing data in XML ...{snip}... allows for flexibility in format, something that tends to be more problematic with flat files.

    People keep saying that but I've not see that to be true, yet. Do you have an example?

    EDI. As the spec has evolved, or there are changes to products between businesses, you can flex the XML. You can't in a flat file, since when you start doing business, you might not know about the ways in which a product, PO, or invoice should be described. It's much easier to change to accommodate new fields in the middle with XML than a flat file.

    There are other similar areas where your data schema evolves.

    However it does belong in the db, and there are valid cases for it. I might have data like this list of books, perhaps it's old data and not really queried, but I want it around. Perhaps I shred the title out of the XML and store that relationally, but I have a large amount of other data that I keep in XML as a blob. I can query it if I need it, or shred it, but I can also just store it as archival data.

    I guess my question would be, have you actually used it for such a thing? Again, an example would add to the discussion quite a bit.

    ...

    Session state is a place we've used it. We don't care about most of the data, and didn't want to store it all in RDBMS, but we wanted the data handy for debugging or picking up later. If we did need to use the data, mostly from singleton rows, we could shred the XML as needed.

    We've also used it in past companies where we wanted to keep around some data exchanged from a client that was sent in XML. In that case, the xfer would have been better in flat files, but we got XML. We could store this as a blob, though in those days we had to use C++ on a client to shred out XML and get info out.

    I have seen it from a friend's client to keep config information that was passed out to client locations. Essentially an .ini or config file kept in the db. THe information varied, and needed to be in XML for the client applications to work, but the db was the store for this. Kept in synch with clients better than keeping this in the file system. Since extracting the XML file was easier with an XML store, it fits in the db. Rarely was this assembled or shredded in the db, but it was a possibility.

    Again, all low volume places where you can manage a piece of business logic centrally in the db, but performance or scale aren't important.

    If you're lazy about keeping data that is transactional and often queried/changed, and store it in XML, you deserve every complaint you get.

  • Hugo Kornelis (7/13/2015)


    SQLRNNR (7/13/2015)


    I see stats, memory grant info, cpu requirements, duration, number of pages, cost (granted it is based off some smoke and magic), number of rows, and even an implicit conversion warning that affects the cardinality estimate.

    The memory grant is one part of the story. Was it sufficient, or did something spill during execution. Since SQL 2012, hash and sort warnings are visible in the plan. Older versions, not the case.

    CPU usage as reported by SET STATISTICS TIME is a lie.

    Good enough right there. I'm done playing devil's advocate because you said that.

    I always love it when somebody can point out that SQL Server lies. 😉

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I don't like XML either.

    The JSON functionality in SQL2016 is more my thing.

  • What is all this XML nonsense? 😀

    Thanks for the question though. I thought I was at a SQL Server certification exam again 😀

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Ed Wagner (7/13/2015)


    And here I thought I'd be the first one to comment that I don't like XML. 😀

    Add me to that list.

  • Viewing 15 posts - 31 through 45 (of 47 total)

    You must be logged in to reply to this topic. Login to reply