SSIS Data Flow when Data Access Mode is sql command is very slow

  • I have a ssis package where when I do a data flow from an entire table to a flat file, it's quite fast - a handful of minutes for 50 million records output. But when I do data access mode sql command from variable, it's unbelieveably slow - like hours and hours.

    I really need to use this second method (sql command from variable) because I am using a T-SQL method to chunk out the 50 million dataset into chunks of (whatever the user inputs as the variable, in my case 7 million) and so the whole thing has to be dynamic.

    Is there anything I can do to make the sql command from variable be a lot faster - like the table to flat file was, or am I screwed?

  • I would bet it isn't the SQL command from variable that is causing the problem - it almost certainly is the query you have created that is causing the problem.

    There are actually known performance issues using the table options directly - where using a SQL command or command from variable is much faster.

    When you use the table directly - SQL will perform a table scan (clustered index scan) to read the table and can start streaming the results immediately.  Using a command (query) SQL Server has to parse the query and filter the results which means it will have a different execution plan.  And since you are filtering the results it is going to take longer to process.

    Review the generated query and optimize the query.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Okay, point taken and thank you for teaching me that. However, this query is very simple. It is literally just select column one column two column 3 from table and then there is an order by and a fetch and an offset. That's it. The point was to essentially output all records but chunked into 7 million at a time.

  • I suggest that you run one or two of these 'dynamic' queries in SSMS and take a look at the actual execution plan. I agree with Jeffrey in that it is more likely to be the query which needs tuning than a problem with data access mode.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • nobab23 wrote:

    However, this query is very simple. It is literally just select column one column two column 3 from table and then there is an order by and a fetch and an offset. That's it. The point was to essentially output all records but chunked into 7 million at a time.

    Are you the same person as the OP? If so, why did you create a new account?

    What seems like a simple query to you may be giving the query engine a headache because your table is not indexed in a way which supports the query. This would explain why the partial extraction is slower than extracting the whole table and also explains why I asked about the execution plan.

    • This reply was modified 1 year, 5 months ago by  Phil Parkin.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • This was removed by the editor as SPAM

  • isaac_654135 wrote:

    Okay, point taken and thank you for teaching me that. However, this query is very simple. It is literally just select column one column two column 3 from table and then there is an order by and a fetch and an offset. That's it. The point was to essentially output all records but chunked into 7 million at a time.

    If you don't have indexes to support the filtering, ordering and fetching then that query will absolutely take longer than a simple table scan.

    If the goal is to chunk out the results into different files, there are better ways.  You can use a conditional split in SSIS - basing the split on either an existing column or a calculation.  A simple method would be to add a row number - either in the query or in SSIS - and then conditionally split the results based on that number using modulo (e.g. rownum MOD 7 - splitting on result = 0 through 6).

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • HECK NO! I am not the same person as that weirdo spammer posting things about  hair removal after my sql server post.

    FYI!

    I am a real person as is pretty easily googleable by my last and first name being my username..

  • isaac_654135 wrote:

    HECK NO! I am not the same person as that weirdo spammer posting things about  hair removal after my sql server post.

    FYI!

    I am a real person as is pretty easily googleable by my last and first name being my username..

    No offence was intended.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Phil sorry if I came across as overly "shout"-y, I just felt that sudden surge of worry that I might be mistaken for a spammer and given the boot, ha ha.

    Jeff thanks, that seems brilliant - except I'm trying to create a template that can be reused super easily and a person only have to change the table for source and the input chunk size (or something fairly close to that).

    for conditional split wouldn't I have to create a hard fixed # of data flows or are you thinking of something more dynamic?

    I am going to ponder your advice a bit more on my end as well. Thanks

  • No worries.

    If your intention is to make this completely dynamic, such that the usual inputs the table name and the desired number of chunks as a parameter, you are going to run into query optimisation issues, as previously suggested.

    You will not easily be able to use dataflows and conditional splits, because these require the structure of the table (columns names & datatypes) to be known at design time. I say 'easily', because it is possible to create a more-complex query which concatenates all of a table's columns into a single, delimited string and feeds that to a dataflow - SSIS can deal with that.

    If you are prepared to compromise and allow only a defined number of chunks - anything between one and five for example – then a rownumber/conditional split solution should be viable.

    Out of interest, how are you intending for the user to pass these parameters into the package at run time?

    Another possibility would be to write out the table to a single file and then split it afterwards. This feels a bit dirty, because the same data gets written twice, but it is possible.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • I am a little open minded as to which elements I'd like to be dynamic and which ones I'm OK with having the user put some design work into on a case by case basis, I suppose.  I just couldn't hardly believe my eyes at the rather extreme time difference (MANY hours as opposed to 5 minutes) incurred by changing from "output the whole table into a flat file" into "output a query using an order by, offset and fetch next" query.  I guess it just makes offset and fetch not seem quite as nifty as they first seemed.  (I realize that statement may be an over generalization on my part but if it's taking SQL Server THAT long to order and find the right pages it seems about as useful as a rbar loop that loops through almost every darn row).

    I actually was VERY interested in a solution that splits the file after the fact, and indeed, have written no fewer than 2 so far.

    So my benchmark is a Python script written by someone else that takes 2 hours to grab the records in 'chunks' and output to text files.  The very simple VBScript I wrote myself using filesystemobject textstreams already beats that by 25%, coming in at 1.5 hours.

    I then moved to vb.net, reading the whole thing into an Array using file.readalllines, after which point I'm comfortable streaming line by line to text files - and I have a feeling this method would beat the best method by a long shot, maybe coming in somewhere at under an hour, not sure - maybe even better.  My problem there is when I read a 50 million text file into a vb.net String array it crashes the memory and I get system.outofmemory exception, after about 20 seconds of reading it into the string (array) variable. sysadmin tells me I should have plenty of memory but they were going to get back to me about how to solve this.

    Yes it's dirty but hey - if it's 5 x faster I don't care how redundant it is on paper.  The nice thing about splitting it after the fact is that that method could, of course, be extremely portable and re-usable.

  • Sounds like you're having a lot of fun with this one!

    If you know the following two pieces of information:

    a) How many chunks are required

    b) How many rows are in the 'big file'

    it would be possible to calculate chunk size (ie, number of rows per chunk).

    With this information, is it not possible to stream the file in and then out again in its chunks, rather than reading the whole thing into memory?

    This sort of thing is what I mean (copied from a Stack Overflow post)

    using (StreamReader sr = File.OpenText(fileName))
    {
    string s = String.Empty;
    while ((s = sr.ReadLine()) != null)
    {
    //do your stuff here
    }
    }

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply