August 17, 2011 at 4:54 am
I have a requirement to read a large file (can be MB/GB in size), segregate whole XML documents within it that may span across multiple lines, extract and store as individual records in SQL Server. Erroneous/incomplete XML documents in the file are ignored.
Currently, I'm using a script source component, reading the file line by line and when end tag for an XML document is reached, creating an output row and populating this along with a few other fields to be stored in db. My file contains 700K+ XML documents and this takes about 7-8 hours to load in the db. I observed that for the initial few hours, my oledb destination doesn't even get active and I'm assuming that CreateNewOutputRows completes for all records in the file before triggering the destination.
Is this how it works? Is there some way by which I can optimize this load or may be force a periodical flush of output buffer to trigger the destination?
August 21, 2011 at 6:53 pm
August 22, 2011 at 8:36 am
The files coming in may belong to 4 different DTDs and can contain namespaces too. When I tried using XML source with DTDs, I faced issues. That is why.
August 22, 2011 at 8:44 am
GattJ (8/22/2011)
The files coming in may belong to 4 different DTDs and can contain namespaces too. When I tried using XML source with DTDs, I faced issues. That is why.
You can use XSLT to pre-process the XML files. Have you seen this post?
August 22, 2011 at 9:49 am
As per comments for the post, there may be problems while working with files of 100MB or more. In our case, we frequently receive multi-GB files. Still, will try this out.
August 22, 2011 at 10:58 am
It doesn't work quite like you're describing.
What happens is that the CreateNewOutputRows method adds each row, one at a time, to a set of rows called a buffer of a specific size. When the buffer is "full", it gets passed to the next component (which could be your destination, but could also be some other components - depends what your data flow looks like). The source script is not aware of, or in control of this activity - it just adds rows to the output buffer. This video might help explain.
If the behaviour you're seeing is a long delay before inserts happen - this may not be a "bad thing". It's the end-to-end time that you want to minimize, right? But then, you can't tell without changing things to see if that would be better, can you? What can you change to alter this behaviour? There are two properties of the Data Flow: DefaultBufferMaxRows and DefaultBufferSize. Those help suggest (not dictate) to SSIS how many rows will be in each buffer. (If you use CLOBs like DT_TEXT, I don't think those columns are included in the size calculation, they're stored "off row".) Play with those properties to see if that changes your execution.
Todd McDermid - SQL Server MVP, MCTS (SQL 08 BI), MCSD.Net
My Blog - Dimension Merge SCD Component for SSIS - SSIS Community Tasks and Components
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply