Data Flow Task vs Stored Procedures

  • Hi,

    We recently had a Microsoft Gold partner come in to demonstrate their Business Intelligence Enterprise Framework (BIEF), which they developed. There system handles all aspects of BIEF including logging, auditing, data quality, and reporting for ETL jobs.

    Their approach was to use SSIS as job control only. All heavy lifting is done through the use of stored procedures. One of the main reasons for this was speed. I've been building ETL's for a couple years now using SSIS and have not run into a situation where a stored procedure was any faster than a properly designed data flow.

    My opinion and my experience is to use what works. Our primary tool is SSIS but when a stored procedure works better that is what I use.

    Here's another thought, from an architecture standpoint, having a dedicated SSIS Server (like we do) that runs DTSX packages that only call stored procedures on the SQL server grossly under-utilizes our SSIS Server but adds that processing to the SQL servers.

    Honestly I don't like their approach, but I am a bit biased. Looking for some other thoughts. Thanks - tom

  • Hi there,

    This is a very interesting question that I was actually about to post... I am currently in a situation where I could either use a data flow task to load staging tables to have my updates staged and perform a set-based update in an execute sql task back in my control flow, or use the OleDB Command in the data flow to perform the update right away.

    This brings the question of where the load should go. Same thing when talking about finding a MIN, MAX, SUM etc. etc. These are all available in SSIS but I'm not sure exactly how they compare with SQL Server itself. Surely if they are available in transforms they must be of some use. I can understand how in certain situations, it's better to stage data and then do one update vs. 1000 individual updates but when you just don't know beforehand how you're supposed to handle things, I have a hard time figuring out when to use SSIS and when to use SQL stored procedures etc.

    So indeed, this is a very interesting question. Currently, I have a mix of both, depending on what needs to be done, but for example, lookups etc. can be done with joins very easily in a query as well. My concern is that the choice is not necessarily always based on a solid assumption that this is better done by SQL or by SSIS.

    Right now, I'm playing with a fact table loading package that needs to update fact rows or insert new ones... I just don't know if I solely rely on the data flow for that or stored procedures inside a transaction-controlled sequence container in the control flow. What I'm thinking of doing is to use the data flow to load a staging table and have an update performed at the end via a SQL task in the control flow. This allows me to take control of buffers etc. in the data flow but in the end, I will throw an update statement to SQL and I have not found a perfect way to do this without impacting SQL Server.

    Anyways, I'm just rambling around here but your question is very interesting and I'm curious to see what people do or where they draw the line.

    Regards,

    Greg

  • I'm not sure if my luck will run out or not but, so far, every time I've rewritten a DTS package to be done all in T-SQL, I've beaten the pants off the DTS package for performance. I don't believe the same will hold true with SSIS but I haven't had to do any conversions yet... so I can't say for sure.

    I will say that's it's a bit refreshing to see a company that uses some of the "old school" but very reliable methods of doing high speed ETL.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ducon,

    I think we're in the same boat on this one. I agree with your approach on updates, the OLE CMD in my opinion is nice to have but I've taken it out of every package unless the number of updates never exceeds 100 or so. I batch my updates in a temp table and run a set based TSQL command to perform the update.

    I've had data flows where I was doing many merge joins and lookups, then switched it to a stored procedure and switched it back because the performance wasn't any better. Part of this may be due to the fact I'm not very good at optimizing a large query and couldn't get the performance from the TSQL.

    Like I mentioned in my post, I am biased. We don't have a large budget so our SQL Servers are crowded and I really had to push to get my dedicated SSIS server. I'd hate to see all the processing happening on the SQL server and all the resources of my SSIS server just go to waste.

    The part that bothers me with their approach is that Microsoft best practice doesn't do it that way. Of course they wrote it so they're going to use all the components that are included.

  • Hi,

    Well, Microsoft will always do things in many different ways based on what they're trying to show/prove. For instance, when they try to show benchmarks etc. (e.g. the 1TB in 30 minute "achievement"), they will tweak things, but often these are not desirable in a production environment. So I tend to take these with a grain of salt. Another thing I've found is that often, some Microsoft reps are not the ones designing the packages and could not justify why one way was chosen over the other... Chances are, the people implementing the package did the fastest way and already had stored procedures etc. available to they just put them in SQL tasks in a nice transaction-controller container, without too much thought regarding better ways to achieve this.

    Like you said, I'm trying too to use SSIS not to replace SQL but to reduce the load as much as possible so that SSIS takes some of it. But in other cases, using transactions, try/catch blocks and stored procedures, a load of ETL tasks could be achieved in pure TSQL, and I'm not sure performance would be so much worse. But the SSIS components should be allowed to run slower, at off-peak times only, while a production SQL Server should not or does not have off peak hours. So in these cases, the option of processing as much as possible in SSIS and using fast load/set based updates is interesting.

    There are many things I like about SSIS but others, when using transactions especially, that render it a bit cumbersome. For instance, lookups with early arriving facts, fast loads etc. are not always usable and so the data flow task loses most of its appeal. When that is gone, the main thing left is TSQL in the control flow. Although that's not true if you're dealing with heterogeneous data sources. But from SQL tables to SQL tables (e.g. from an ODS to a DW), the decisions are not always easy to make I find.

    Regards,

    Greg

  • Jeff,

    I figured I'd hear from you on this one. Like I mentioned in my prev post I've had data flows that I switched to stored procedures, I was in the phase where I didn't like SSIS yet and felt much better running the sprocs. After seeing the pref problems, with the large queries, I started switching them back.

    I've only used DTS a few times and that was just for some ad-hoc table loads before SQL 2005 so I don't have a comparison to SSIS for speed.

    In our environment we only have 1 transactional system that uses SQL Server. The rest of data comes from Pervasive, DB2, Excel, and ascii files. SSIS data flows are very useful in extracting data from those systems.

    Not really sure what answer I'm looking for, guess I was just surprised by their approach and wanted some other opinions - Thanks and happy Thanksgiving

  • Greg,

    Good points, also one thing I need to remember is they are a consulting company and developed this approach to make deployment to customers more streamlined. Also they probably have more people on staff that can write quickly write TSQL than put together SSIS packages. Also from the standpoint of source control stored procedures will be easier to manage, similar to C# or VB where they can easily merge code.

    I contradicted myself earlier by saying we only have 1 transactional system on SQL but have crowded servers. We have other systems one of them being SharePoint which makes it very busy.

    2 of our larger data marts get updated every 15 minutes and every 30 minutes. Being an airline the business needs are very unique. Things that happen throughout the day can impact operational decisions later in the day. We cube up the data as the day progresses since monthly goals are impacted by every flight, especially as we approach the end of the month. Another reason to keep this daily processing off the SQL server and have it run on the SSIS server.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply