August 10, 2009 at 3:09 pm
Hello,
I have a question regarding a current package I have running which performs well, but takes a long time to run under heavy load. The process is as such:
1: Messages arrived in a staging table in the form of XML documents
2: an SQL Task reads these into an Object variable
3: A ForEach loop container uses the Object variable to read every XML message as a string
4: Inside the ForEach loop there is a data flow task that is basically an XML Source transform that takes the string XML variable and saves it into a table based on its element type (basically a message can be of several different types, each mapped onto a different XSD element, the purpose being the save the data into the right table based on its type/element)
My issue is that the package takes a long time to execute. Each pass in the loop is ok but it just adds up and up and the whole package can take hours when there are short of 15,000 in my staging table.
I imagine the ForEach loop is not quite designed to do this kind of work originally but since the XML Source does not take any input I don't really how to do this any other way.
I really like the convenience of the XML Source in that I just use my XSD elements and map them into destination columns but I would like to improve performance...
Has anybody run into these kinds of issues? If so, what did you do to help? I was thinking of maybe trying to have a Script Task call my XML source task instead of having the ForEach loop but I'm not sure that would help much in terms of performance.
Any help would be greatly appreciated.
Thanks,
Greg
P.S: The connection manager is already set to keep connections 🙂
August 10, 2009 at 3:18 pm
Are you doing anything with the XML data or does it just go from your staging table straight into your destination tables?
So the sole purpose of your package is to insert XML data into a table?
If you are not reading the data in from a source, transforming it, and inserting it, SSIS may not be the best tool for the job here.
Based off of what you've said here, a Serivce Broker solution would dramatically increase the performance here by allowing the process to multithread instead of processing one row at a time.
Or just a stored procedure may do the trick. I have a feeling that there is more to this than that. Can you clarify?
August 10, 2009 at 4:50 pm
Hello,
Thanks for the reply. Well, there is not much more to it than what I have described. Originally, we were having a java MDB perform this operation using a JaxB/EJB mapping that worked quite well. However, there were issues with the customer in that it required extra components to be installed, and so on... So back then we looked for an alternative (purely for the XML to table insertion part, I know there is no XSD validation etc. like in JaxB but we control both end of the pipe here so it's not supposed to be an issue of "trusting" the message contents) and the XML source task looked like a good way to achieve the same purpose.
Maybe giving a bit more context will help clarify things here... We develop and application mostly in Delphi that implements COM/COM+ components, not the latest .NET gizmos but it does work quite nicely. We want to provide a data repository with facts and dimensions that are based on the data and the actions users perform in this application. When they perform an action for example, we save it as an XML message into a special table. This is the table I was talking about. Then, in an asynchronous fashion, these "actions" or "events" are moved into a staging area before our fact and dimension packages kick in and move them into the final data repository. The great thing about this is that everything is done using SSIS.
I am curious (and interested) about the alternatives you are offering actually? Could you expand (or maybe link) about this if you have time?
As you can see, it's not necessarily ideal but it does have its advantages. The package in question here serves no other purpose (other than having a couple of auditing/logging steps) but any alternative solution should ideally be able to fit into a package (for scheduling, interactive debugging etc.). We can't really impose an app server or a complex setup of another windows component to our customers unfortunately, as I am sure you can understand if you develop applications. I'm not implementing a one-time solution here but looking for something that we can ship with our product.
Thanks,
Greg
August 10, 2009 at 5:00 pm
So just to be clear, your application is tracking user events in the form of XML messages in a table. You want to move those messages into other tables based on the message type where each message type represents a certain user action or event. Your SSIS package runs and transforms the XML data in these event-specific tables into your dimensions and facts. Is this right? How is the SSIS package triggered, by SQL Agent at a certain time, interval?
Do all of these user actions/events end up in the same fact table?
August 10, 2009 at 5:33 pm
I appreciate the time you are taking, it really helps because it makes me look at things from a higher perspective too 🙂
In the application, the ODS as Kimball would call it I guess, a user makes an action A. The application builds an XML message with the information relevant to this action (for example "Asset A moved from point X to point Y) and saves it into a "message" table. All messages of all types end up in this same table (later on I want to split dimension and fact messages for obvious reasons).
A package, scheduled via SQL Agent, at relatively frequent intervals, takes the contents of this table, puts it into a temporary table and, using the dreaded ForEach loop, sends each XML message into a data flow task. This data flow task uses an XSD that contains all possible message types sent by the application (for information, the application formats the messages according to the same XSD, hence the "trusting" of the message contents). Based on the type of message (ie the XSD element type encountered), the contents of the message are saved into a staging table that is dedicated to this message. Now, in certain occasions, one message can be sent to two staging locations as well, just so that our ODS does not send two messages with the same info but for different purposes.
Then, at regular intervals (usually about 30 min. for facts or once or twice a day for dimensions), we have master packages that are scheduled via SQL Agent that get executed. They basically will take the data from the staging table they deal with and process it, this time performing real ETL tasks before sending the processed data into the final dimension or fact table.
So in essence, the process is in three stages, each independent from the other. This is by design to allow for better scalability and fault tolerance among other things (as well as the ability for a customer to later choose to process events from staging into facts more or less often for certain processes based on their needs):
1- Send an XML message into a central table
2- Read the XML messages and send them into appropriate staging tables
3- Read the staging tables and perform ETL to save fact/dimension data
If this is unclear let me know. Thanks,
Greg
August 11, 2009 at 9:40 am
Greg,
The for each loop container is not the problem. The problem is caused by the data flow -> xml source -> xsd validation etc process. I'm thinking about this process:
1. Inside the for each container implement custom script task, which determines the type of the XML document and then based on the type stores in variable the table where this document should go.
2. Inside the for each container insert standard Execute SQL Task, which inserts the current iteration XML document into the table determined in step 1.
August 11, 2009 at 9:51 am
Thanks for the info Greg. I've got a busy morning ahead of me, but I'll get something back to you today sometime. I'll share with you how I did this almost same thing using Service Broker. I think you can eliminate the need for the first SSIS package even without using Service Broker!
August 11, 2009 at 10:55 am
Hello,
No worries, I appreciate your time. Looking forward to your comments on this one. The process I described is pretty much "per the book" but the actual implementation is not, I realize that so any good ways to make it more efficient is a great help.
Regards,
Greg
August 11, 2009 at 10:58 am
Hello,
I had actually spent a bit of time this morning replacing the ForEach with a custom For Loop that uses an Array built from my recordset and, as you say, the loop is not the issue, seems that the validation is not slow per se, but repeated over and over, it's just not that great. I guess this is because of the parsing to send the message to the right ole db destination.
My other attempt, which was unsuccessful was to try to read my data in a script task and trigger the data flow task programmatically from there. However, I was never able to get the second part so I just gave up. I think this is not a very pretty solution anyway, if it were to work at all...
Greg
August 12, 2009 at 10:01 am
Late last year, I was tasked with taking an OLTP database that had become a combination OLTP/OLAP reporting database and creating a way to run the OLTP application against it while also expanding out the OLAP functionality for a long term archive database that would be used for reporting, archive, and data mining. The data retention and usages varied drastically from the OLTP and OLAP perspectives (duh, right) that I decided to create a new model for the OLAP data using dimensional modeling and star-schema. I ended up creating a new schema inside the OLTP database to house this new model and I initially used SSIS to ETL the data from the dbo (OLTP) schema into the new OLAP schema.
I say initially because a new requirement was created to where they wanted the data in the star-schema real time. The challenge with this was that the SSIS package was run via SQL Agent on a schedule so there would always be some latency. Granted, you could schedule the package to run every minute, but with the amount of data that could go through the package, we were afraid that the latency would be noticeable. So….. I went through a phased approach to getting rid of the SSIS packages in total.
We had 2 packages, one for setup related data and one for transactional data. I set up 2 Service Broker queues, one for each. Now, when a change was made on the OLTP side, I used a trigger to drop the XML message into the Service Broker queue that represented the change in data. The SSIS packages were changed to, instead of being called via SQL Agent, monitor the Serivce Broker queue and act upon any message dropped into the queue. This worked fairly well, but I still had concerns with the throughput of the SSIS package for transactional data. Since the package was set up to act upon messages in the queue, it became a sequential process and only able to operate on one message at a time.
Enter the activation procedure. My XML message that represented a transaction were created to match a specific XSD that I created so that we could allow other source systems the ability to create a standard XML message and send a transaction into our archive system. So most of the transformations would be done by the source systems to get the XML into the correct format. All that was needed to be done to get the data into the star-schema then was lookups for the key values in the dimensional tables. At this point, I threw away the SSIS package for transaction data and transferred all of the ETL logic into a set of stored procedures that took the XML message, ripped it apart, performed the lookups (including creating new rows for SCD type 2 dimensions), and created the facts. This ‘import’ stored procedure was then made to be the activation stored procedure for the Service Broker Queue. Serivce Broker allows you to set a limit on how many instances of an activation procedure you want to run simultaneously using the MAX_QUEUE_READERS setting. We then set MAX_QUEUE_READERS to 10 and Service Broker handled throttling the processing threads for the queue based on the queue volumes. This means that Service Broker was able to run up to 10 simultaneous imports! The import process became fast enough to where there was no visual latency between changes in the OLTP system and fact record creation in the OLAP side of the house. This system is up and running now at one of the top 10 largest banks in the US.
I said all of that to say this, your process can be improved drastically by using Service Broker to manage your queues instead of staging tables. If you think about it, your staging table design is really just a queuing system that you have to manually maintain. Let Service Broker do it for you! So how do you get started? I’m sure your application is like most out there in that you can’t just redesign it. You can, however, make some changes that will help you know.
OK, so you’ve got SSIS doing 2 things for you now. First, you’re using it to poll through your main message table and redirect message to their destination tables where a second SSIS process (or group of processes) will pick them up and transform them into facts. Let’s start by eliminating the first process. I see 2 ways to replace this with minimal impact on your application. Both methods require using a view as a set based interface into a stored procedure. I don’t know the name of your master message table so I’ll call it Messages.
1. Using T-SQL logic to move your messages.
a.Drop your Messages table.
b.Create a View called Messages with the same schema definition as your table.
c.Create an Instead of Trigger on your view. This will be used to place your message routing code. The view itself will never contain rows, it’s just an interface into the stored procedure tied to the trigger.
d.Place logic inside the trigger that reads the incoming messages, determines the message type, and inserts the message into the corresponding message table.
PROS:
-Eliminate SSIS package and maintenance surrounding it. This is not really an ETL operation, you’re using SSIS as a messaging protocol.
-Automatic routing of messages into their appropriate queue tables.
-Eliminate dependency of second SSIS package. It can be run at any time without having to ensure that the first package has completed.
-By using the view/trigger option, you will not need to re-code any of the messaging code in your application. It will go about business as usual and ‘think’ that it is putting messages into the messages table.
-You are 1 step closer to a Service Oriented Database Architecture!
CONS:
-Getting buy-in from peers. Whomever came up w/ the SSIS solution may not like the idea of it being replaced.
-Throughput? I’m not sure how many messages you need to send through this process. Since there’ll be additional code running behind the ‘insert’ process, it may not scale as well as you need it to.
2. Using Service Broker to move your messages
a.Drop your Messages table.
b.Create a view called Messages with the same schema definition as your table.
c.Create an Instead of Trigger on your view. The trigger code will drop the incoming messages into a Service Broker queue.
d.Create a stored procedure to put your message routing code into.
e.Use the SP above as the activation procedure for your Service Broker queue.
PROS:
-Same as option 1.
-All of the advantages built into Service Broker (read BOL).
-Real-time message routing.
-Service Broker will scale up to handle high volumes. Throughput will not be an issue. If SB needs to run 10 threads of the routing code, it will.
CONS:
-Again, peer buy-in. That’s it.
Both methods get you closer to a Service Oriented Database Architecture with true message processing capabilities built into the database layer. Getting the first SSIS package out of the mix will let you focus on the second package (or set of packages) that are actually handling the messages and creating the facts and dimension rows. You could eventually replace this with Service Broker queues as well and create your facts/dimension rows real-time in the database without having to rely on external processing.
Anyway, this is long winded but I wanted to show you what was possible with using Service Broker and it’s messaging and scalability options. It may make sense for what you are doing.
August 12, 2009 at 10:13 am
Hi,
I'm going to take a bit of time to read your post but I definitely want to thank you for taking the time and effort to describe the process you followed and analyzing it in such details. Truly amazing, kudos to you :). I'll read it and will post back my impressions. I just wanted to say a couple more things about the situation you described.
It's very interesting that you point out your problem of querying the reporting area in real time for events that "just" happened. This "near real time" issue has popped up last week on the project I am talking about actually. Since me and another developper knew what that meant, we were very clear in saying that back when the specs of the system were put down on paper and we decided to go with SSIS, we made a point of saying that this would be for reporting purposes. Maybe we could, in near real time, have something like 15/20 min. delay on some facts but it was clear that mixing OLAP and OLTP data was not just a bad idea but something dangerous.
It's funny because no matter how much we had told the project people that this was to be considered a system "disconnected" from the ODS, and that was why it could handle such load and perform such things as real ETL processes, they still tried to come back with a "ok in that report we could show the current inventory level in real time"... That's where I actually had a bit of a meltdown to be honest :crying: I just told them that next would be contextual menus to interact with the OLTP system, which we could not do because our key mappings were reserved for the ETL process and could under no circumstances be used for such hacks.
In my head, that's the real power of the warehouse but I think it's more of a learning process for others than a limit that people should try to push because it just won't deliver otherwise...
August 12, 2009 at 10:27 am
Well, from what you've just described, Service Broker may be your best friend. If you need to track real-time or near-time inventory and other things, a message based service oriented architecture is exactly what you want to work towards.
Much of your decision, I'm sure, will be based off of how difficult it will be to re create your SSIS data flows in stored procedures. I found it to be very easy in my scenario, but I did a bunch of it with XML where I passed XML nodes as input parameters to stored procedures that operated on those nodes. The whole message processing system relied heavily no XML messages, even in the SP code. It was a bit of a learning curve for me to learn the new XML capabilities of SQL Server 2005 (and up), but I had a great deal of XML expreience in a past life and I think the time taken to figure this out in T-SQL was went spent!
August 12, 2009 at 11:14 am
Hello John,
Again thanks for the long reply. I have read your post in and out a few times. Here are my impressions.
I think your point regarding the message reading/staging is dead-on. There is definitely a bottleneck in using the current setup since we can only process one message at a time and perform the same type of parsing to decide where the message should go etc.
I do like you first solution but it might be problematic in that I don't want to have these operations perform in a synchronous fashion. the OLTP's insert into the view with the INSTEAD OF trigger would still wait for the trigger to complete (ie for the SP to process the message and send it to the right table) without any other "staging" area if I'm not mistaken?
The second solution does sound very elegant indeed. While getting rid of all packages is not much possible I'm afraid because the ETL process works better in regular batches for us, this would be ideal for the message staging part of the process. Much like the initial plan we had with the MDB actually... The MDB was doing exactly what the service broker you described. We abandoned it also due to transaction issues because some of our components are COM-based and do not support transactions (yes, this is still something that can be found...).
On a different note, I'm the original implementor for the packages so I'm not taking offense in that ;). There is an issue with this part of the process, it saved us some coding time but does create a performance bottleneck. Back then this had been used to replace the Java component and we did not have the required time to study scalability etc. so I guess we're paying the price for it now.
I'm going to think about all this and maybe give a try to some of the things you suggested, if anything for the future. Like I said, I think the ETL process is better off in a real SSIS package, because we could have coded something to do the same logic, but this way we can debug in real time the packages, provide interactive logging etc. which is convenient for us and our customers. But the staging part, yes, it's not ETL and as such, does not have to do with SSIS...
Regards,
Greg
August 12, 2009 at 11:26 am
Thanks for the feeback.
Just another comment on using the view to get the message into a SB queue. Your concern with the first option is why I chose not to go that route. You'd end up holding your message insert transaction open while the routing logic ran. That's the major downside. The second option will fly! Inserting the message into the SB queue is extremely fast and with the SB process scalability options, you'd never end up waiting on messages getting to your staging tables, it would be real-time.
Keep my posted on your progress either through this thread, or send me a PM with progress or feedback. I'll be interested to see if this helps you, I think it will.
August 14, 2009 at 7:34 am
Hi John,
Thanks for the info. Yes, your comment does point out my fear so I'm pretty sure that while I might experiment with it, it won't make it into the real thing. I'm going to read up on the service broker over the next couple of weeks, the current design cannot be changed for a bit due to an imminent rollout. I'll definitely let you know of any changes I make to our current pattern to go back toward a real messaging system.
One last thing though if I may... How is message reception/delivery guaranteed with the service broker? Can you for example set a retry count etc. and have a COM+ application sending the message performing a rollback using the COM+ transactional context (which uses DTC) on its whole transaction? I'm gonna look it up later on anyway but I'm curious about how overall this compares to a messaging service like jms.
Thanks,
Greg
Viewing 15 posts - 1 through 15 (of 25 total)
You must be logged in to reply to this topic. Login to reply