Creating indexes on a very large table takes over 5 hours

  • (I originally used the wrong quote with my comments here, so I've repeated them below but with the correct quote first.)

    • This reply was modified 3 years, 11 months ago by  ScottPletcher.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • The ORDER referred to is when using BULK INSERT, for example loading a physical file into a table.

    In those cases, adding "WITH ( ORDER(clus_key1, clus_key2) )" can allow SQL to do bulk loading "knowing" that the data coming in is already sorted in clus key(s) order.  Otherwise SQL assumes the data is not in order which can result in a far less efficient load.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Jeff Moden wrote:

    @pamkagel ,Do your insert in the following form...

     INSERT INTO dbo.yourtablename WITH (TABLOCK)
    (full column list here)
    SELECT full column list here
    FROM dbo.whatever
    JOIN dbo.whatever etc
    WHERE whatever relationship is necessary to get the right data but must be SARGable
    ORDER BY list of column(s) in the exact same order as the clustered index
    OPTION (RECOMPILE) --undocumented but usually required and won't be in execution plan if you're lucky with your joins
    ;

    The WITH(TABLOCK), ORDER BY in the correct column order, and the OPTION (RECOMPILE) are absolutely essential to get the minimal logging.

    ...

    Of course, you actually have already been told partially how to do this by Scott in his post above.   The two things he left out is the ORDER BY to guarantee the inserts will "follow the rules" for "Minimal Logging" and the undocumented OPTION(RECOMPILE), which won't hurt if it's actually not necessary and helps guarantee that you'll get the "Minimal Logging".  In that dual 120 Million row row insert test I told you about above, IT was the difference between a 4 minute load/600MB log file and a 20 minutes load/40GB log file.

    The ORDER BY is useless (rather than "absolutely essential").  ORDER BY doesn't control the order of physical INSERTs into a table, only the assignment of identity values.  I've quoted Paul White's details on minimal logging to an empty clustered index below (emphasis was in the original, it was not added by me).

    I do not feel that I gave "partial" answer.  I stated what was needed to get minimal logging.  Although the RECOMPILE couldn't hurt, I can't imagine it being needed here.

    https://sqlperformance.com/2019/05/sql-performance/minimal-logging-empty-clustered ::

    Unlike other bulk loading methods, it is not possible to specify the required ORDER hint on an INSERT...SELECT statement. This hint is not the same as using an ORDER BY clause on the INSERT...SELECT statement. An ORDER BY clause on an INSERT only guarantees the way any identity values are assigned, not row insert order.

    For INSERT...SELECT, SQL Server makes its own determination whether to ensure rows are presented to the Clustered Index Insert operator in key order or not. The outcome of this assessment is visible in execution plans through the DMLRequestSort property of the Insert operator. The DMLRequestSort property must be set to true for INSERT...SELECT into an index to be minimally logged. When it is set to false, minimal logging cannot occur.

    Having DMLRequestSort set to true is the only acceptable guarantee of insert input ordering for SQL Server. One might inspect the execution plan and predict that rows should/will/must arrive in clustered index order, but without the specific internal guarantees provided by DMLRequestSort, that assessment counts for nothing.

    When DMLRequestSort is true, SQL Server may introduce an explicit Sort operator in the execution plan. If it can internally guarantee ordering in other ways, the Sort may be omitted. If both sort and no-sort alternatives are available, the optimizer will make a cost-based choice. The cost analysis does not account for minimal logging directly; it is driven by the expected benefits of sequential I/O and the avoidance of page splitting.

    DMLRequestSort Conditions

    Both of the following tests must pass for SQL Server to choose to set DMLRequestSort to true when inserting to an empty clustered index with table locking specified:

    An estimate of more than 250 rows at the input side of the Clustered Index Insert operator; and

    An estimated data size of more than 2 pages. The estimated data size is not an integer, so a result of 2.001 pages would meet this condition.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher wrote:

    The ORDER BY is useless (rather than "absolutely essential").  ORDER BY doesn't control the order of physical INSERTs into a table, only the assignment of identity values.  I've quoted Paul White's details on minimal logging to an empty clustered index below (emphasis was in the original, it was not added by me).

    I do not feel that I gave "partial" answer.  I stated what was needed to get minimal logging.  Although the RECOMPILE couldn't hurt, I can't imagine it being needed here.

    I don't blame you for your skepticism.  My experiences on these things will appear heterodoxic to most.

    The ORDER BY will be skipped if it's not actually needed.  Do you absolutely know every time it will not be needed?  It's a good safety to keep.   The same goes for the Recompile.  It probably won't be needed here, especially since since there are no runtime-populated variables used but, compared to the overall problem, it's a good safety.  It's a bit like saying to always use >= and < instead of BETWEEN for date/time criteria... yes, there are places where BETWEEN will work but if the underlying datatypes ever change, it might not continue to work as expected so I tell folks to not take the chance.  Keep the code bullet proof.  And with that, I say they are required.  It's ok for you to disagree but, to me, the code is missing some safety if those things are missing.

    Looking back a bit and just to be sure on a previous post on this thread... when I mentioned to not use the SELECT/INTO, I was talking to the OP because he used SELECT/INTO to populate the table and then add the Clustered Index.  You used SELECT/INTO the right way (rowless SELECT/INTO to create an empty table).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    ScottPletcher wrote:

    The ORDER BY is useless (rather than "absolutely essential").  ORDER BY doesn't control the order of physical INSERTs into a table, only the assignment of identity values.  I've quoted Paul White's details on minimal logging to an empty clustered index below (emphasis was in the original, it was not added by me).

    I do not feel that I gave "partial" answer.  I stated what was needed to get minimal logging.  Although the RECOMPILE couldn't hurt, I can't imagine it being needed here.

    I don't blame you for your skepticism.  My experiences on these things will appear heterodoxic to most.

    The ORDER BY will be skipped if it's not actually needed.  Do you absolutely know every time it will not be needed?  It's a good safety to keep.   The same goes for the Recompile.  It probably won't be needed here, especially since since there are no runtime-populated variables used but, compared to the overall problem, it's a good safety.  It's a bit like saying to always use >= and < instead of BETWEEN for date/time criteria... yes, there are places where BETWEEN will work but if the underlying datatypes ever change, it might not continue to work as expected so I tell folks to not take the chance.  Keep the code bullet proof.  And with that, I say they are required.  It's ok for you to disagree but, to me, the code is missing some safety if those things are missing.

    Looking back a bit and just to be sure on a previous post on this thread... when I mentioned to not use the SELECT/INTO, I was talking to the OP because he used SELECT/INTO to populate the table and then add the Clustered Index.  You used SELECT/INTO the right way (rowless SELECT/INTO to create an empty table).

    ORDER BY is never needed, in the sense that it's irrelevant to whether SQL uses minimal logging or not.  Therefore, the code doesn't need it for "safety".  And it's potentially a problem if the clustering key changes and the ORDER BY then doesn't match it.  You might cause SQL Server extra work in that case.

     

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher wrote:

    ORDER BY is never needed, in the sense that it's irrelevant to whether SQL uses minimal logging or not.  Therefore, the code doesn't need it for "safety".  And it's potentially a problem if the clustering key changes and the ORDER BY then doesn't match it.  You might cause SQL Server extra work in that case.

    I don't have a specific example to justify the ORDER BY.  Only what I've experienced in the past.  I'll see if I can come up with one or not.  I'll either prove that I'm right or that you're right. 😀

    In either case and whatever anyone believes, the key to speeding this up is as we've both done... create the table WITH the CI in place and use "Minimal Logging" to populate it instead of building the CI after the table population.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @sergiy Thank you so much for your help on updating the query to make creating the view possible. That small change to the where exists made all the difference.  No longer needing to create the table or indexes.  Appreciate it!!

    • This reply was modified 3 years, 11 months ago by  pamkagel.
  • @pamkagel ,

    Based on a phone call I got earlier today, do you work in Michigan, U.S.A.?  I ask because it basically the same question I was asked if I could help with.  I don't want to name the company here but their office is in Ann Arbor, Michigan.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeffModen  I do not work in Michigan.  I work in California.  So not the same company.  But thats interesting.

  • pamkagel wrote:

    @Sergiy Thank you so much for your help on updating the query to make creating the view possible. That small change to the where exists made all the difference.  No longer needing to create the table or indexes.  Appreciate it!!

    You’re welcome.

    Glad it’s worked out for you.

    Probably you’ll see more places in the database where similar approach could deliver as well.

    _____________
    Code for TallyGenerator

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply