SSIS inserts are very slow

  • Stefan Krzywicki (11/4/2011)


    alexander.suprun (11/4/2011)


    Stefan Krzywicki (11/3/2011)


    Brandie Tarvin (11/3/2011)


    I forgot to mention something.

    If push comes to shove and you can't use an Execute SQL Task, and absolutely must have a Data Flow Task instead, use a stored procedure as your OLE DB data source instead of a table. It is still a lot faster than pulling your data from a table, view, or plain query in the Source Transformation.

    (Change the data access mode to SQL command, and call the proc with EXECUTE <myprocname>)

    Interesting. I'll keep it in mind. Fortunately, there shouldn't be anything keeping me from changing to to an Execute SQL Task.

    There is NO big difference in performance between reading directly from a table or using stored procedure. Why should it be at least 1% faster? It still has to read the whole table either from cache or disk. View, table, sp or ad-hoc query - it doesn't matter until you don't use any parameters.

    You know, I remember hearing at one of the SQL Server seminars I've gone to that using the SSIS read from a specified table is slower than creating the query, but I don't remember why. I can believe that the stored procedure would be faster than either simply because it could have a stored query plan and the other methods would not.

    A plan for "select * from Table" will be saved in the procedural cache as well. So there is no difference here. SSIS uses the same OLE DB data providers as any other clients use, it doesn't utilize anything fancy to get the data. So I am really impatient to see an example where a stored procedure will outperform a table.


    Alex Suprun

  • alexander.suprun (11/4/2011)


    Stefan Krzywicki (11/4/2011)


    alexander.suprun (11/4/2011)


    Stefan Krzywicki (11/3/2011)


    Brandie Tarvin (11/3/2011)


    I forgot to mention something.

    If push comes to shove and you can't use an Execute SQL Task, and absolutely must have a Data Flow Task instead, use a stored procedure as your OLE DB data source instead of a table. It is still a lot faster than pulling your data from a table, view, or plain query in the Source Transformation.

    (Change the data access mode to SQL command, and call the proc with EXECUTE <myprocname>)

    Interesting. I'll keep it in mind. Fortunately, there shouldn't be anything keeping me from changing to to an Execute SQL Task.

    There is NO big difference in performance between reading directly from a table or using stored procedure. Why should it be at least 1% faster? It still has to read the whole table either from cache or disk. View, table, sp or ad-hoc query - it doesn't matter until you don't use any parameters.

    You know, I remember hearing at one of the SQL Server seminars I've gone to that using the SSIS read from a specified table is slower than creating the query, but I don't remember why. I can believe that the stored procedure would be faster than either simply because it could have a stored query plan and the other methods would not.

    A plan for "select * from Table" will be saved in the procedural cache as well. So there is no difference here. SSIS uses the same OLE DB data providers as any other clients use, it doesn't utilize anything fancy to get the data. So I am really impatient to see an example where a stored procedure will outperform a table.

    So give it a try. As I mentioned before, I'm not going to have an environment set up to test this for a day or so.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Stefan Krzywicki (11/3/2011)


    Each has a GUID primary key.

    Unless you defined otherwise, that's likely your problem. Most people define PK's with the default settings and the default settings are to make the PK with a Clustered Index.

    If that's true, here's what happens at a 30,000 foot level...

    Tables are empty. Nothing to page split. First insert comes along and data is sorted during the insert so no page splitting which allows it to run quite fast.

    Then second set of inserts comes along. GUID is nothing more than a random number. Most people also make the mistake of using a "0" FillFactor on such a PK. Where do you think additional random numbers (GUIDS) will fit on each page? The answer is, they won't until the a page split occurs.

    And, because of the nature of your data, your tables will spend millions of times more resources and time doing page splits than it will actually doing the inserts.

    You would think that the fix is to create an IDENTITY column and use that as a Clustered Index to keep the post-initial inserts from spending most of their time fragging your table with page splits and using a unique non-clustered index on the GUID PK column. Maybe, but probably not... the non-clustered index will also split but instead of doing mere page-splits, it will do extent splits.

    Yes... if there are no FKs pointing to the GUID PK, you could simply drop the constrant before a batch insert and rebuild it, but even that will take a fair bit of time.

    It's things like this that cause me to find alternates for GUIDs. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Brandie Tarvin (11/4/2011)


    Evil Kraig F (11/3/2011)


    I should learn not to post when I'm typing quickly.

    Or when you have the lint ball monster from H3CK watching over your shoulder with it's evil googly eyes. 😀

    "Lint ball monster"? Heh... you should see my dust bunnies. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • In the MCTS book on SQL 2005 BI, there is a paragraph on when SSIS can and cannot actually perform a fast load (bulk insert) style operation, even when the option is specified. Finding that information again is proving difficult (I don't have the book anymore).

    However, MSDN does have some useful information on the topic of bulk inserts in general:

    "If a non-empty table has indexes, bulk imports are fully logged, even under the bulk-logged recovery model. To decide whether to remove the indexes, consider whether the benefit of bulk importing into a table with no indexes is greater than the cost of dropping and re-creating them.

    If you bulk import data into an empty table with indexes and you specify the batch size, the table becomes non-empty after the first batch. Starting with the second batch, the data is fully-logged. For empty indexed tables, consider performing the bulk import in a single batch."

    There are several MSDN / books online pages about optimizing bulk inserts containing this and other information.

    I certainly agree with Jeff... your clustered GUIDs are killing you. But this may be another important point in terms of helping to explain what you're seeing (even if you can't really do anything about the fact that data is already in the table. On the other hand... I suppose you could go down the more advanced route of inserting into a new, empty table and then partition switching the data in? This is a lot of extra work)

    Additionally, if the data is already sorted by the destination clustered index before it actually lands in the table, make sure you tell SSIS that this is the case, and set the commit size to zero. If it is not so sorted, consider sorting it before inserting and setting the commit size to zero. This can be an order of magnitude level improvement.

  • Jeff Moden (11/4/2011)


    Stefan Krzywicki (11/3/2011)


    Each has a GUID primary key.

    Unless you defined otherwise, that's likely your problem. Most people define PK's with the default settings and the default settings are to make the PK with a Clustered Index.

    If that's true, here's what happens at a 30,000 foot level...

    Tables are empty. Nothing to page split. First insert comes along and data is sorted during the insert so no page splitting which allows it to run quite fast.

    Then second set of inserts comes along. GUID is nothing more than a random number. Most people also make the mistake of using a "0" FillFactor on such a PK. Where do you think additional random numbers (GUIDS) will fit on each page? The answer is, they won't until the a page split occurs.

    And, because of the nature of your data, your tables will spend millions of times more resources and time doing page splits than it will actually doing the inserts.

    You would think that the fix is to create an IDENTITY column and use that as a Clustered Index to keep the post-initial inserts from spending most of their time fragging your table with page splits and using a unique non-clustered index on the GUID PK column. Maybe, but probably not... the non-clustered index will also split but instead of doing mere page-splits, it will do extent splits.

    Yes... if there are no FKs pointing to the GUID PK, you could simply drop the constrant before a batch insert and rebuild it, but even that will take a fair bit of time.

    It's things like this that cause me to find alternates for GUIDs. 😉

    Thanks, I had a feeling the GUIDs were at the very least a big part of it. It is one of the things I want to test in the Dev environment as I'd need really compelling information to get them to switch. Since we get the data in with GUIDs, we'd have to change it on every record. Fortunately, I wouldn't have to have the GUIDs in the table after that, so no index on a GUID, clustered or not, but it is still a hard argument to make.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • allmhuran (11/4/2011)


    In the MCTS book on SQL 2005 BI, there is a paragraph on when SSIS can and cannot actually perform a fast load (bulk insert) style operation, even when the option is specified. Finding that information again is proving difficult (I don't have the book anymore).

    However, MSDN does have some useful information on the topic of bulk inserts in general:

    "If a non-empty table has indexes, bulk imports are fully logged, even under the bulk-logged recovery model. To decide whether to remove the indexes, consider whether the benefit of bulk importing into a table with no indexes is greater than the cost of dropping and re-creating them.

    If you bulk import data into an empty table with indexes and you specify the batch size, the table becomes non-empty after the first batch. Starting with the second batch, the data is fully-logged. For empty indexed tables, consider performing the bulk import in a single batch."

    There are several MSDN / books online pages about optimizing bulk inserts containing this and other information.

    I certainly agree with Jeff... your clustered GUIDs are killing you. But this may be another important point in terms of helping to explain what you're seeing (even if you can't really do anything about the fact that data is already in the table. On the other hand... I suppose you could go down the more advanced route of inserting into a new, empty table and then partition switching the data in? This is a lot of extra work)

    Additionally, if the data is already sorted by the destination clustered index before it actually lands in the table, make sure you tell SSIS that this is the case, and set the commit size to zero. If it is not so sorted, consider sorting it before inserting and setting the commit size to zero. This can be an order of magnitude level improvement.

    We already have the table partitioned and those partitions are quite large.

    I'll test sorting the GUIDs before insert as well, thanks to everyone for all the testing ideas.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Alexander Suprun (11/4/2011)


    Stefan Krzywicki (11/3/2011)


    Brandie Tarvin (11/3/2011)


    I forgot to mention something.

    If push comes to shove and you can't use an Execute SQL Task, and absolutely must have a Data Flow Task instead, use a stored procedure as your OLE DB data source instead of a table. It is still a lot faster than pulling your data from a table, view, or plain query in the Source Transformation.

    (Change the data access mode to SQL command, and call the proc with EXECUTE <myprocname>)

    Interesting. I'll keep it in mind. Fortunately, there shouldn't be anything keeping me from changing to to an Execute SQL Task.

    There is NO big difference in performance between reading directly from a table or using stored procedure. Why should it be at least 1% faster? It still has to read the whole table either from cache or disk. View, table, sp or ad-hoc query - it doesn't matter until you don't use any parameters.

    Not true. It's where the read is happening that makes all the difference. And usually, when people read from a table in SSIS, they're reading all the columns of the table instead of just what they need. Even unchecking columns in SSIS doesn't mean it won't read those columns initially.

    So stored procedures can be faster unless you're choosing a "SELECT *" in the sproc.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (11/7/2011)


    Alexander Suprun (11/4/2011)


    Stefan Krzywicki (11/3/2011)


    Brandie Tarvin (11/3/2011)


    I forgot to mention something.

    If push comes to shove and you can't use an Execute SQL Task, and absolutely must have a Data Flow Task instead, use a stored procedure as your OLE DB data source instead of a table. It is still a lot faster than pulling your data from a table, view, or plain query in the Source Transformation.

    (Change the data access mode to SQL command, and call the proc with EXECUTE <myprocname>)

    Interesting. I'll keep it in mind. Fortunately, there shouldn't be anything keeping me from changing to to an Execute SQL Task.

    There is NO big difference in performance between reading directly from a table or using stored procedure. Why should it be at least 1% faster? It still has to read the whole table either from cache or disk. View, table, sp or ad-hoc query - it doesn't matter until you don't use any parameters.

    Not true. It's where the read is happening that makes all the difference. And usually, when people read from a table in SSIS, they're reading all the columns of the table instead of just what they need. Even unchecking columns in SSIS doesn't mean it won't read those columns initially.

    So stored procedures can be faster unless you're choosing a "SELECT *" in the sproc.

    So it has nothing to do with stored procedures. It's not about what to use but how you do it. One can use a query or view inside OLE DB data source selecting only columns that are needed and get the same good effect as with stored procedure. There is no magic in here.

    I just don't want anyone to be confused by the statement that stored procedure is a lot faster than a query, without any further clarifications. Only by switching from query/table/view to stored procedure you cannot make any improvements in performance. There is no magic in here.


    Alex Suprun

  • I will let you argue the point with the man I learned this from, Brian Knight. Talk to him.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Alexander Suprun (11/7/2011)


    Brandie Tarvin (11/7/2011)


    Alexander Suprun (11/4/2011)


    Stefan Krzywicki (11/3/2011)


    Brandie Tarvin (11/3/2011)


    I forgot to mention something.

    If push comes to shove and you can't use an Execute SQL Task, and absolutely must have a Data Flow Task instead, use a stored procedure as your OLE DB data source instead of a table. It is still a lot faster than pulling your data from a table, view, or plain query in the Source Transformation.

    (Change the data access mode to SQL command, and call the proc with EXECUTE <myprocname>)

    Interesting. I'll keep it in mind. Fortunately, there shouldn't be anything keeping me from changing to to an Execute SQL Task.

    There is NO big difference in performance between reading directly from a table or using stored procedure. Why should it be at least 1% faster? It still has to read the whole table either from cache or disk. View, table, sp or ad-hoc query - it doesn't matter until you don't use any parameters.

    Not true. It's where the read is happening that makes all the difference. And usually, when people read from a table in SSIS, they're reading all the columns of the table instead of just what they need. Even unchecking columns in SSIS doesn't mean it won't read those columns initially.

    So stored procedures can be faster unless you're choosing a "SELECT *" in the sproc.

    So it has nothing to do with stored procedures. It's not about what to use but how you do it. One can use a query or view inside OLE DB data source selecting only columns that are needed and get the same good effect as with stored procedure. There is no magic in here.

    I just don't want anyone to be confused by the statement that stored procedure is a lot faster than a query, without any further clarifications. Only by switching from query/table/view to stored procedure you cannot make any improvements in performance. There is no magic in here.

    While you might not get a benefit (or much of one, I'd need to test it to find out) from switching to a stored procedure from a SELECT (Specific Columns) FROM in the package, you will get a benefit by not using Table or View when selecting your data.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 11 posts - 31 through 40 (of 40 total)

You must be logged in to reply to this topic. Login to reply