Boss wants me to avoid "Execute SQL" tasks

  • I am re-designing existing SSIS packages.

    Schema is being changed and some validation logic.

    Existing packages have lots of "Execute SQL" SSIS tasks where thye do all kinds of updates,validations, calling stored procedures where validation is done.

    My boss doesn't like these "Execute SQL" tasks. He thinks it's hard to troubleshoot SQL for the operator.

    He's actually asking me if we can completely replace these "Execute SQL" tasks and do all updates, validation inside SSIS package. I was thinking about it and looks like in order to implement the same logic now I will have to do a lot of Derived Columns,Lookups,Conditional Splits,Converisions,Expressions and only after that load into final staging table.

    Is it worth it? What's the best practice? Is it OK to call stored procedures from SSIS or should I SSIS tasks to do the validation?

  • If your Execute SQL Task is not in a loop and does not contain loops or cursors then I'd say your performance will be best using that task. If you are doing looping then a Data Flow task with all the steps in it might be worthwhile.

  • No looping.

    Just simple:

    UPDATE t1 SET VALID = 'Y' FROM t1

    INNER JOIN t2 ON t1.ID = t2.ID

    and so on....

    and so on.

    I talked to SSIS developers in one of the big banks here. The guys said they write all the logic using SSIS tasks. Their IT Director wanted to get rid of SQL as well.

    Why they hate T-SQL so much?

  • They don't like it because they don't really understand it. Sure, you can do it all in SSIS Data Flows and it will work, but updating 1000 or more rows on a row by row basis (Data Flow or SSIS task) will take seconds, while updating with T-SQL will take milliseconds. I'd suggest re-writing just one fairly simple Execute SQL Task in SSIS and then show my manager the difference in performance and complexity and then if they won't listen to logic do what I'm told with some documentation with at least an email that I wanted to do it one way and the boss said another. It's called CYA and, unfortunately, it's usually necessary in these types of situations when load time goes from an hour to 2 hours.

  • I understand your boss's problem and here's the simple solution to maintain performance and to make him happy.

    Take all the code in those Execute SQL Tasks and make them stored procedures on the DB. This way, you don't have to update the package to update the code, but you keep the performance capabilities of the SSIS package.

    We have the same issue due to SOX compliance and this solved all our problems. Now we don't have to take packages back through the SDLC process, we just have to take the procs through it. Which makes life MUCH happier for all concerned. @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie,

    The boss won't like it either.

    It's not that he doesn't want to see too many "Execute SQL" tasks in SSIS package.

    He doesn't want to call stored procedures from SSIS in general.

    I have to give him really good reasons why we need to call stored procedures

    from SSIS.

  • I'm going to risk being contraversial here, but I'd probably agree with your boss...

    The execute SQL task has its place and is indeed the fastest way to perform a large update. Other good uses for it are package control types of operations...

    This all seems obvious, however I have seen so many times developers using the SSIS control canvas as a T-SQL execution platform. This is usually because they've heard a very general statement that T-SQL is much faster and therefore they refuse to adopt SSIS as an integration tool the way its meant to be used. It might also be because, they are familiar with T-SQL and not so with SSIS and therefore aren't using the right tool for the right job.

    I might perhaps agree with you in your situation (I don't know your requirements), however in your bosses shoes, I'd probably say the same thing to my developers. I don't want to see execute SQL tasks unless there is a good reason for it.

    Flame away 😉

    Kindest Regards,

    Frank Bazan

  • No flames here. @=)

    I don't know how it works in 2008, but I have done tests in 2005 where an Execute SQL task has proved 5 times faster than doing a Data Flow task with Merge Joins and Conditional Splits.

    And an Update proc works MUCH faster than using the OLE DB Command task because my proc is using Set based logic instead of RBAR.

    On the other hand, there are other situations where you can't do what you need by using an Execute SQL Task. I always look at both situations first and choose the one that works best for me.

    FYI, Andy Warren & Brian Knight both recommend using T-SQL in SSIS whereever possible because they also have noticed performance gains. That's not to say they're always right on everything, but they seem to know what they're talking about in this instance. @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • But isn't a faster execution of code a good enough reason?

    Plus you don't browse through tens of different SSIS task objects like "Conversion,Split,Merge,Lookup",etc.

    It's all in one nice stored procedure.

    I might be wrong but isn't it easier to troubleshoot a stored procedure rather than troubleshooting different SSIS objects?

    I'm not against having more logic implemented inside SSIS.

    I'm just trying to foresee which way is better

    from the troubleshooting point of view.

    That was the biggest complaint abouth the existing SSIS packages.

    "Extremely hard to troubleshoot. Total frustration when the package fails and does not give any clue where the problem is exactly."

  • Brandie,

    You wrote:

    On the other hand, there are other situations where you can't do what you need by using an Execute SQL Task. I always look at both situations first and choose the one that works best for me.

    What is it exactly that you can't do using "Execute SQL"?

    I've never had this situation. I'm curious.

  • riga1966 (8/26/2008)


    Brandie,

    You wrote:

    On the other hand, there are other situations where you can't do what you need by using an Execute SQL Task. I always look at both situations first and choose the one that works best for me.

    What is it exactly that you can't do using "Execute SQL"?

    I've never had this situation. I'm curious.

    I've never been able to come up with good T-SQL to do a row count on a text file, for instance. Or to loop through a group of files on the server and do different actions depending on what the file name is (you can do this in SQL, but it's really hideous and faster to use the SSIS ForEach Container).

    There are a few other things I haven't been able to do with an Execute SQL tasks, most of them related to data transformation, but I can't remember them off the top of my head. If I do, I'll post them.

    Execute SQL does work for a majority of things, though, if you know how to code.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • I have to agree on both the points above which may leave you wandering why I said what I said. Apologies in advance for the long ramble...

    I agree that Execute SQL tasks tend to perform faster than SSIS dataflow transforms in any set based operation you can think of. Aggregations, sorting, joining etc

    I also agree that a stored procedure is easier to troubleshoot for most SQL developers (although maybe not for drag and drop specialists ;-)).

    However, my agreement with the boss comes from a slightly different perspective... If your ETL package stages your data at all, then there is a good chance that you're writing updates (faster though they may be), where, you didn't need to in the first place.

    This might be fairly trivial in a table of 100k records, but when you're writing to a table with 10s of millions of records, you're going to be far better off if you can perform all your integrating without the need for any updates at all.

    I'll try to elaborate a bit here using an example I came across at a client site... I was asked to optimise a slow running SSIS package that was written to provide a report for the business everyday. The table was approximately 20million records by 120 columns. The package itself used about 8 Execute SQL tasks which held the SQL to update the imported temporary table (about 2million records) and various trivial dataflows in between. It took 2 hours to complete…

    The advantage came from removing all but 2 of the execute SQL tasks (1 to get the date parameters and one to insert the audit records) and instead performing the same logic in 1 dataflow. I eventually got the load time down to 20 minutes.

    From Source to Destination, or CSV to reporting table, a dataflow task was able to improve vastly on the overall time it took.

    Bearing in mind various factors it’s not hard on this occasion to see why:-

    As many organisations do, we had SSIS on its own dedicated server reducing overhead on DW SQL Server.

    The transform operations run in memory and so resource contention is less likely.

    In the original package, the dataflow was initially used was to import the CSV data into temp tables, thereafter Execute SQL tasks were used to make several individual updates to the temp table with computed values and various looked up records (this was bad in itself as the update could have been done in 2 or 3 statements instead of the 10 or so that they had used). After the updates had been performed, the temporary data was then inserted into the main reporting table.

    In the optimised package, these values were derived in SSIS, or selected out of the lookup tables from the source. No updates were ever required because the data was never staged. It was written straight into the reporting table.

    As you rightly said, you have to look at the picture as a whole. If you wanted to compare the 2 then you have to measure Source to Transformed destination. e.g. The time it takes from loading your CSV file to having a cleaned dimension say.

    Also, as I've said before there are occasions when the execute SQL task is perfect for the job. (Updating a large dimension for example) However, if I were the boss, I'd be weary of developers using it everywhere, or with no real thought to the overall picture. It is for this reason that I'd say as a general rule, don't use it unless there is a good reason to.

    Geez, I've written far more than I intended...

    It's a good discussion point you've raised though.. should your boss be dictating how you develop packages? Is support of packages more important than how they perform? I don't know, but interested to hear what others think.

    Kindest Regards,

    Frank Bazan

  • Frank raises several good points, especially about what a resource hog SSIS can be if it's on the same server as a DW or heavily used OLTP DB. Everything should always be tested, and then later on re-evaluated, to verify the package is the best it can be. But who defines what is best?

    I just love "it depends" answers. @=)

    Whether performance is worth any cost or a package with little maintenance is better really does depend on your resources, hardware & human. In my environment, we have so many things going on, performance is much more important along with the whole SOX / SDLC issue. But that might not always hold true. Hence the need, a year or two down the road, to re-evaluate the packages and make sure they are still fulfilling our needs.

    Anyway, that's my short ramble. @=)

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • So here is a sample of what I need to do

    and how can I do it with just SSIS, no "Execute SQL" task?

    I have loaded data from REVADJ.txt file.

    Now I need to update loaded data:

    update REVADJ.txt data

    set valid='Y'

    from

    REVADJ.txt STG,

    INNER JOIN cis_account cur

    ON

    cur.cur_acct_num = STG.cur_acct_num

    and STG.cust_agrmnt_num is NULL

    I tried to use a Lookup object in Data Flow but it didn't work.

    It is very simple to just load text file data into staging and execute UPDATE SQL but I am trying to learn how to do this without loading into staging table, on the fly.

  • update REVADJ.txt data

    set valid='Y'

    from

    REVADJ.txt STG,

    INNER JOIN cis_account cur

    ON

    cur.cur_acct_num = STG.cur_acct_num

    and STG.cust_agrmnt_num is NULL

    In this example I'd say that your best bet is the lookup component (although it can be done with a merge join using the left outer join - not as quick). I'm not sure why this didn't work for you... if you post some details, we might be able to help.

    Drop the component on to the dataflow and drag the output of the flat file source on to it. Write a SELECT for the lookup with the columns you need from [cis_account] table (cur_acct_num & cust_agrmnt_num from the look of it). Ensure that if other columns are required for any other downstream components that you include these in the select. Check those columns that you require and create the join between cur_acct_num in your pipeline and cur_acct_num in your lookup data.

    The output I'd expect here is the cust_agrmnt_num for any cur_acct_num's that exist in your database table and NULL for those that don't.

    If the join isn't working its likely to be because your datatypes aren't the same. You can fix this by casting in the select statement or adding a data conversion component between the flat file and your lookup component.

    Then drag a derived column component on to the canvas. Pull the output from your lookup component to the derived column component.

    Create a new column [IsValid] and derive it based on an expression something like ISNULL(cust_agrmnt_num)? "Y" : "N"

    You can then insert straight into the destination table.

    HTH

    Kindest Regards,

    Frank Bazan

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply