Need some help with a large data volume optimization

  • Did you ever do anything with that missing index on those processes in the c.u.r.s.o.r.? (Or making the CI on BatchGroup?)

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Craig Farrell (12/13/2010)


    It's definately set to true, but that's mostly for the secondary attribute columns...

    Ok I'm going to assume that you need NULLs from the input file for reasons I don't need to know about, but that you now have everything you need to ensure the output of the SSIS package is sorted in the same order as the clustered index keys on the staging table, and with column types and nullability that exactly match the schema of both PortSecAgg and PortSecAgg.

  • WayneS (12/13/2010)


    Did you ever do anything with that missing index on those processes in the c.u.r.s.o.r.? (Or making the CI on BatchGroup?)

    I'm not dismissing the idea at all, but I think there's much more value in restructuring the data load as I have been describing. If we were to limit ourselves to optimizing the existing process, I'd agree that specific index changes, matching the schemas between stage and destination, making the clustered index on stage unique, and replacing the left join/null check combination with a more efficient NOT EXISTS...would all be good things to do. We could certainly speed things up a bit by doing these things, but we would be missing a much bigger opportunity.

  • Turning to your previous long post, Craig. First up, the sorting and column data types and nullability:

    The idea of sorting the SSIS output on the clustered index keys of the staging table is to enable minimally-logged fast load (MLFL). Such loads are typically an order of magnitude (or more) faster than fully-logged loads. The conditions for MLFL are extensively documented in Books Online (I'll try to remember to include some links at the end of this post) but the important ones for your situation, on SQL Server 2005 are:

    1. Empty destination table with a clustered index (no non-clustered indexes)

    2. TABLOCK hint

    3. Data presented in clustered key order

    4. ORDER hint specified

    5. Database using the BULK_LOGGED or SIMPLE recovery models

    Typically, these are achieved by:

    1. Truncating the staging table at the start of the SSIS package (an Execute SQL task)

    2. Dropping any nonclustered indexes on the staging table (another Execute SQL task)

    2. Ticking the 'Table Lock' checkbox of the OLE DB Destination component and choosing 'Table or View - Fast Load'

    3. Sorting the data in SSIS (or otherwise arranging the flow so that the required order is guaranteed)

    4. Setting the FastLoadOptions in the Advanced Editor of the OLE DB Destination

    5. Ensuring the database is not in FULL recovery (with a valid backup chain)

    Step 4 bears a little extra explanation as it is easily overlooked. You need to use the Advanced Editor with the OLE DB Component, switch to the Component Properties tab, and manually edit the FastLoadOptions setting. If you ticked the Table Lock checkbox, the existing entry will be "TABLOCK". This needs to be changed to, for example, "TABLOCK, ORDER (AsOfDate, PortfolioID, SecurityID, IsShortPosition)". For clarity, the ORDER hint needs to exactly match the SSIS Sort order, and the clustering keys of the target table.

    On the column types and nullability. Your DDL (back in post #1) included the following mismatches between stage and PortSecAgg:

    1. PortfolioCode: VARCHAR(10) versus VARCHAR(12) in staging

    2. Cusip: VARCHAR(15) versus VARCHAR(40)

    3. IsSubAccount: BIT versus VARCHAR(4)

    4. IsShortPosition: BIT versus INTEGER (computed column)

    In addition, the following columns disagree on nullability:

    1. PortfolioID

    2. SecurityID

    3. AsOfDate

    4. IsShortPosition (computed column is NULLable)

    For clarity, I am asking you to make IsShortPosition a regular BIT NOT NULL column in stage - get rid of that computed column.

    The other schema differences which need to be fixed are:

    1. PortSecAgg has a primary key, whereas the staging table has a non-unique clustered index. Both need the same PK.

    2. The staging table has the extra columns LoadFileName and BatchGroup (these are no longer required?)

    Finally, for neatness as much as anything, I would change stage to have the same columns in the same order as PortSecAgg.

    edit: I did forget the links!

    Prerequisties for Minimal Logging in Bulk Import

    In fact, you should review all the content under Optimizing Bulk Import Performance.

    I appreciate that's a fair amount of reading, but these are important concepts that will repay the effort many times over in future work.

  • On the question of processing a single item at a time.

    It should be clear that processing complete new-day files one at a time makes sense. By one at a time, I mean that stage is cleared and loaded with a single day, which is then switched into the destination. There is nothing to stop you using a for each loop in the SSIS to load single days one after another - the key is that stage is cleared, fast bulk-loaded with a single day, and then switched into PortSecAgg. I hope that is clear, please ask if not.

    As far as deltas are concerned, as I said before, you may find that there is little benefit in using the switching strategy here. You might choose to clear stage, load all the deltas, and then use a cursor to move the combined stage deltas to the destination. Which strategy I would choose might depend on the number of rows per AsOfDate versus the number of rows in a typical delta.

    Let me be clear about the overriding concerns:

    1. You always want to use minimally-logged fast load (MLFL) from SSIS to stage. I would only consider relaxing this if it turns out that each delta is very small.

    2. Using MLFL needs an empty staging table, so you cannot achieve MLFL for any except the first load if you load deltas consecutively. If you can build your SSIS package to read all deltas, and submit them together, all at once, to an empty stage table, that's fine.

    3. Once stage is loaded with all the deltas you have, you would use a cursor or other looping construct to iterate over the AsOfDate/PortfolioID combinations. For each set, DELETE PortSecAgg WHERE AsOfDate = 'constant 1' AND PortfolioID = 'constant 2'. It is crucial that the DELETE plan shows a single Clustered Index Delete operator here.

    4. Finally, insert the new rows from stage to PortSecAgg. You can do this INSERT either inside the loop (one per AsOfDate/PortfolioID combination) or in a single step after all the deletions. Experiment to see which is more efficient.

    That describes how I would look at implementing the delta loads, without using partition switching. If you agree, I would recommend keeping things as simple as possible at this stage - so don't use the partition switching for the deltas. That is a potential optimization that may or may not turn out to be worth the coding effort.

    If you can provide rough estimates for the number of rows per AsOfDate, and the number of rows in a single-file delta, that would be interesting to know, thanks.

  • Having said I think it's best to keep things simple and skip the partition-switching with the delta loads, I see you asked about determining the partition number for the T1 switching step. There are a couple of ways of doing this, but all are broadly similar. Whatever method we use to determine the correct partition number, we usually end up constructing the SWITCH command in dynamic SQL, and executing that.

    In some cases, the partition number is obvious and can be calculated directly (in my simple example, the partition number is always one more than the day component of the date). A more robust idea is to use the $PARTITION system function (http://msdn.microsoft.com/en-us/library/ms188071(v=SQL.90).aspx) for example:

    SELECT $PARTITION.PF('20101205') returns the partition number the date 5th December 2010 falls into, according to the partition function PF.

    We could also inspect the sys.partition_functions and sys.partition_range_values systems views, but $PARTITION is often better.

  • I'm curious as to why my existing clustered index on PortSecAgg, which is based AsOfDate/PortfolioID/hanging columns doesn't do the same thing, as a range removal. However, because each data file for historical edits holds multiple PortfolioIDs, even this way I would need to either loop-delete, or join to a derived table to find just the AsOfDate/PortfolioID combinations. I'm not sure how this avoids joining at all.

    The deletion approach in your first post tried to do too much at once, resulting in scans, hashes, sorts, spools and the like.

    Crucially, you were trying to perform a selective delete, followed by an update. The query joined PortSecAgg to your temporary table, and then left joined to stage to delete only the rows that didn't find a match in stage.

    A better approach is:

    1. Use the distinct list of AsOfDate/PortfolioID combinations in stage held in @DelPortList to drive a loop or cursor.

    2. Delete rows from PortSecAgg for the single AsOfDate/PortfolioID combination.

    3. Rinse and repeat.

    This avoids the huge join from PortSecAgg to @DelPortList to stage. Because step 2 does not read from PortSecAgg, only writes, Halloween protection is not required - so no spool.

    Assuming no non-clustered indexes on PortSecAgg, the plan for step 2 will show a single Clustered Index Delete operator. This uses the clustered index to seek the rows matching the explicit single values for AsOfDate and PortfolioID, and deletes them. You can't do a logged deletion any faster than that.

    Let me be clear about the underlying reasons: even if you changed your method to just join @DelPortList to PortSecAgg and delete all the rows that match, you're still joining the two tables before the delete. SQL Server will generally not choose a nested loops join, preferring a parallel hash join. You'll need a sort to present the matched rows in clustered index order, and a spool to separate the reading part of the plan from the writing.

    The loop-and-delete-single values alternative avoids the join, sort, and spool. Try it.

  • I think I'm going to tackle these one piece at a time. As strong as the partitioning structure sounds like it may be, I want to get baselines before I tackle that side of things, and there's plenty of other cleanup to do here. There's also the problem of I'm a contractor and if I can't explain it, I can't hand it off when I'm done.

    That's absolutely fine. For my part, I'm just answering a forum question, with the best method I am aware of. It's a very powerful technique, in common usage at large sites, so even if you don't end up using it here, you will benefit from becoming familiar with it for the future.

    Two more things that will help with your broad understanding of this area:

    How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005 (TechNet)

    Partitioned Tables and Indexes in SQL Server 2005 (MSDN - Kimberly Tripp)

  • Gyeah!!!! Hold on Paul! 😀

    My word man, you make me look like I never type anything! 😛 Though, I appreciate the extensive discussion. Don't misunderstand.

    Going to tackle this in order....

    The idea of sorting the SSIS output on the clustered index keys of the staging table is to enable minimally-logged fast load (MLFL). Such loads are typically an order of magnitude (or more) faster than fully-logged loads. The conditions for MLFL are extensively documented in Books Online (I'll try to remember to include some links at the end of this post) but the important ones for your situation, on SQL Server 2005 are:

    1. Empty destination table with a clustered index (no non-clustered indexes)

    2. TABLOCK hint

    3. Data presented in clustered key order

    4. ORDER hint specified

    5. Database using the BULK_LOGGED or SIMPLE recovery models

    I'm actually destroying the clustered index on insert and then rebuilding after. I get a two for one that way, updated statistics and, in theory, faster loads.

    Offhand, I'm in simple recovery. This doesn't need point in time recovery.

    The disparities between the tables are cured. Was part of today's cleanup exercises.

    There's going to be a problem with having the same PK, as BatchGroup (if I don't do a single file load at a time, which is not a current option) will be required to avoid collisions. The deltas need to be loaded, cleaned, washed, pressed, starched, and then shipped to the reporting DB. This process will basically be ran twice, just with all the associations ready to go.

    LoadFileName I use to find bad data. Our proprietary reporting system has a habit of doing something stupid.

    I need to read those links, and will try to go through them tomorrow night when I get home. As usual, the more interesting a problem I have, the more other crap piles up in my workbin. 🙂

    I do agree that, if I didn't have to ship the delta-load to another db, a single day of processing at a time would make perfect sense, and would be incredibly fast. I'm debating on a stream split, one to create the delta to ship, the other to do the local loads in the one-day-at-a-time methodology locally. The only problem with this is multiple batches.

    Example: I receive batches 1, 2, and 3. All batches can have the same PortfolioID/AsOfDate combo. Batch 3 needs to win the argument. Unless I conceive of a way to make sure batches are loaded in the correct order (which I didn't do, expecting to do an all at once scenario, initially), I can't guarantee this result. It would require me skipping the built in for-each loop and doing a roll-my-own scenario, ordering filenames and the like.

    There are expected NC indexes on the PortSecAgg table for optimization. There are later processes that use this data internal to the process, and I haven't finished with them yet. What I can do to help with the MLFL is to remove them before I start and add them back in after. However, that doesn't fix the PK issue I have with needing all the deltas loaded to do priority comparisons.

    Average current day AsOfDate file contains ~180k rows. Average Delta file will contain between 60k-120k rows.

    Thanks for the information about the partition switch identification. I'll look into that after I read the first two doorstoppers. 😎

    I see what you mean about the deletion problem, and you're right. One of the things I've done is completely removed the whole lookback to the PSA_Staging table after building the deletion #tmp reference. I've destroyed the entire Delete/Update/Insert methodology which required the anti-joins, and am doing a straight delete/insert by batch now. That much at least penetrated my thick skull.

    The clustered index on the #tmp matches the same order as the PortSecAgg table, so it should straight merge to find the data. If I have to, I'll beat it over the head and hint it.

    The loop-and-delete-single values alternative avoids the join, sort, and spool. Try it.

    I will. I want to be able to compare the two results with benchmarks, so when someone asks me down the line why, I can tell them the answers, with numbers. 🙂

    That's absolutely fine. For my part, I'm just answering a forum question, with the best method I am aware of. It's a very powerful technique, in common usage at large sites, so even if you don't end up using it here, you will benefit from becoming familiar with it for the future.

    I agree fully. I'm not knocking the idea, nor am I against implimenting it per se. I fear overcoding it to the levels that I have my production DBA's in fear of me ever writing another piece of SQL they have to support at 3AM. I'm also playing catchup on enough of the other items you've presented that I'm feeling a little sheepish.

    Thanks again. Let me catchup on my reading with the articles you recommended, and get the rest of the cleanup and benchmarks done, and hopefully I can grok all of this without nodding my head and pretending to follow along. 😀


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • No worries. Have fun with it! I've enjoyed this thread, but I have some writing to do now. I'll stay subscribed to the thread just in case - thanks again for the interesting topic 🙂

  • Paul White NZ (12/13/2010)


    WayneS (12/13/2010)


    Did you ever do anything with that missing index on those processes in the c.u.r.s.o.r.? (Or making the CI on BatchGroup?)

    I'm not dismissing the idea at all, but I think there's much more value in restructuring the data load as I have been describing. If we were to limit ourselves to optimizing the existing process, I'd agree that specific index changes, matching the schemas between stage and destination, making the clustered index on stage unique, and replacing the left join/null check combination with a more efficient NOT EXISTS...would all be good things to do. We could certainly speed things up a bit by doing these things, but we would be missing a much bigger opportunity.

    Paul, I agree with this. I was just curious to see if it had been checked out or not, since it seems that it had slipped by several others also. However, it didn't affect the big time chuck of the data load, and this partitioning seems like it would handle things really sweet - I've not worked with partitioning before, and I definitely want to see how this progresses!

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Paul White NZ (12/13/2010)


    Typically, these are achieved by:

    1. Truncating the staging table at the start of the SSIS package (an Execute SQL task)

    2. Dropping any nonclustered indexes on the staging table (another Execute SQL task)

    2. Ticking the 'Table Lock' checkbox of the OLE DB Destination component and choosing 'Table or View - Fast Load'

    3. Sorting the data in SSIS (or otherwise arranging the flow so that the required order is guaranteed)

    4. Setting the FastLoadOptions in the Advanced Editor of the OLE DB Destination

    5. Ensuring the database is not in FULL recovery (with a valid backup chain)

    Would there be any issues with building the staging table (dropping it first if it existed) in tempdb? This is a database that you can ensure will be in simple recovery mode. Dropping/creating the table will ensure that it is both empty, and only has the indexes built on it by the process.

    I think that the SSIS package might be able to work with a #temp table by specifying to retain an open connection in the connection manager. Would this be a viable option here also?

    Or is it not worth the above considerations?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (12/14/2010)


    Would there be any issues with building the staging table (dropping it first if it existed) in tempdb? This is a database that you can ensure will be in simple recovery mode. Dropping/creating the table will ensure that it is both empty, and only has the indexes built on it by the process.

    I think that the SSIS package might be able to work with a #temp table by specifying to retain an open connection in the connection manager. Would this be a viable option here also?

    Or is it not worth the above considerations?

    I try to avoid unnecessary load on tempdb, especially stuff that is going to hit the allocation structures hard. Adding a check for FULL recovery to the SSIS package and switching to BULK_LOGGED if required, is my preference. I also prefer an explicit TRUNCATE / non-clustered index drop in the SSIS script to make it self-contained, but that may be just habit.

  • Paul White NZ (12/14/2010)


    WayneS (12/14/2010)


    Would there be any issues with building the staging table (dropping it first if it existed) in tempdb? This is a database that you can ensure will be in simple recovery mode. Dropping/creating the table will ensure that it is both empty, and only has the indexes built on it by the process.

    I think that the SSIS package might be able to work with a #temp table by specifying to retain an open connection in the connection manager. Would this be a viable option here also?

    Or is it not worth the above considerations?

    I try to avoid unnecessary load on tempdb, especially stuff that is going to hit the allocation structures hard. Adding a check for FULL recovery to the SSIS package and switching to BULK_LOGGED if required, is my preference. I also prefer an explicit TRUNCATE / non-clustered index drop in the SSIS script to make it self-contained, but that may be just habit.

    Any reason why you don't do disable instead of drop to avoid the hassle of keeping that script synchronized?

  • Ninja's_RGR'us (12/14/2010)


    Any reason why you don't do disable instead of drop to avoid the hassle of keeping that script synchronized?

    I've written it both ways in different scenarios.

    Wouldn't it be nice if there was a form of ALTER INDEX ALL ON (object) DISABLE command that didn't affect the clustered index? Loading data into a table with a disabled clustered index is a very short operation indeed 😉

Viewing 15 posts - 46 through 60 (of 72 total)

You must be logged in to reply to this topic. Login to reply