Extract the titles from XML

  • mister.magoo (7/13/2015)


    Jeff Moden (7/13/2015)


    However it does belong in the db, and there are valid cases for it. I might have data like this list of books, perhaps it's old data and not really queried, but I want it around. Perhaps I shred the title out of the XML and store that relationally, but I have a large amount of other data that I keep in XML as a blob. I can query it if I need it, or shred it, but I can also just store it as archival data.

    I guess my question would be, have you actually used it for such a thing? Again, an example would add to the discussion quite a bit.

    I have seen people store XML and for the very reasons you mention. To wit, a lot of people have used it as their solution for column level auditing of store-only-changes to data... and they play absolute hell both with method and performance when trying to reassemble point-in-time (PIT) values for a 140 column table. At best, it usually only ties with and is usually worse than an EAV audit table in areas of storage, accessibility, and manipulation back to PIT row values.

    We have just such an usage.

    Third party data comes into the business (keeping it vague) as record based flat files (record prefixes on each line, fixed width per line), and these files are controlled by that third party, so they modify the layout fairly regularly - could be multiple changes per quarter, and the changes only affect new files produced from that date, so they are not backward compatible.

    The files are processed and stored as xml by one process (T-sql) in a well defined format that can be used internally.

    This one process is modified to cope with any format changes.

    The nature of the files being multiple records of varying definition (one file may have 100s of lines/records, and each record could be a different (but recognised) format), there would be a massive job to maintain individual tables (which would need to cope with changes to the definition) for each type of record, especially as we only need about 60-70% of the data from these files.

    So, they are transformed into XML and stored in the database. Other processes can then extract the valid XML and shred the data for storage in the main database tables. Usually, for 99% of the cases, it is extracted once and never needed again, but there are cases where we need to re-pull that data and re-process it into the database (usually for a new business case that needs something different from the data).

    We used to just keep the flat files, but once you have enough of them (millions per year) even the best folder structure in the world starts to become a bottleneck in the retrieval process (oh yes, the file names are not unique either, so we need to extract identifiers from the data).

    That IS a good use of XML and, now that you've brought that bit of computational agony up, it's possible that I might do the same although I'd be on a bit of a mission to save space because that's a whole lot of tag bloated/datatype bloated XML.

    Yes, I know disk space is supposedly cheap... it's actually not if you consider all the aspects of storage. For example, even if it's set off in monthly, read only partitioned file groups, the data will be duplicated at least once on disk as an online backup on another disk. Then there's the cost of rack space, floor space, cooling costs, backup costs, etc, etc. No sense in intentionally wasting space unless that's truly the best and lowest cost way.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/14/2015)


    That IS a good use of XML and, now that you've brought that bit of computational agony up, it's possible that I might do the same although I'd be on a bit of a mission to save space because that's a whole lot of tag bloated/datatype bloated XML.

    Yes, I know disk space is supposedly cheap... it's actually not if you consider all the aspects of storage. For example, even if it's set off in monthly, read only partitioned file groups, the data will be duplicated at least once on disk as an online backup on another disk. Then there's the cost of rack space, floor space, cooling costs, backup costs, etc, etc. No sense in intentionally wasting space unless that's truly the best and lowest cost way.

    The SQL Server stores the XML in a very efficient binary storage format, in my experience only around 20% larger than zipped XML.

    😎

  • Eirikur Eiriksson (7/14/2015)


    Jeff Moden (7/14/2015)


    That IS a good use of XML and, now that you've brought that bit of computational agony up, it's possible that I might do the same although I'd be on a bit of a mission to save space because that's a whole lot of tag bloated/datatype bloated XML.

    Yes, I know disk space is supposedly cheap... it's actually not if you consider all the aspects of storage. For example, even if it's set off in monthly, read only partitioned file groups, the data will be duplicated at least once on disk as an online backup on another disk. Then there's the cost of rack space, floor space, cooling costs, backup costs, etc, etc. No sense in intentionally wasting space unless that's truly the best and lowest cost way.

    The SQL Server stores the XML in a very efficient binary storage format, in my experience only around 20% larger than zipped XML.

    😎

    Now that's some good info, Eirikur. I didn't know that. I check it out. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 3 posts - 46 through 47 (of 47 total)

You must be logged in to reply to this topic. Login to reply