Long Running Transaction when inserting large amounts of data

  • I have a problem where its taking a long time (90 minutes) to commit a transaction. The transaction involves inserting a large number of records from several data files into several tables. The total size of all data files on average is 60MB and could be as high as 300MB. 90 minutes is required for the 60MB. Due to business reasons, the inserted rows need to be accessible anytime in an unaltered state (LOAD STATE) for reporting purposes while this same data can have operations performed (CALCULATION STATE). The application uses integrated security and regular users do not have access to perform bulk inserting.

    The following occurs:

    1. Start transaction from client application.

    2. Bulk insert several files into several tables which I'll classify as BULK INSERT tables (B). The purpose of this step is to get the data into a database.

    a) The B tables have no indexes or constraints.

    b) The B tables have an identity column defined as INT NOT NULL IDENTITY(1,1).

    c) The maximum size of each data file is 10MB but there could be as many as 20 data files being inserted into one table. (>100,000 rows)

    d) Step c can be repeated for other tables.

    e) Due to security reasons, regular users do not have access to perform bulk insert operations so data files are uploaded to the server from the client then bulk inserted by another account via .NET Remoting.

    3. Copy data from BI tables into LOAD (L) tables. The purpose of this step is to record data in the load state.

    a) The load tables have clustered indexes on an internal record number and non-clustered indexes.

    b) The copying of data is performed using INSERT INTO xxx (,,,) SELECT ... FROM yyy.

    4. Copy data from L tables into CALCULATION (C) tables. The purpose of this step is to make the loaded data available for calculations.

    a) The C tables have triggers which update a hierarchy column. This column is used to determine the original row as calculations are performed on tables. For example, row 3000 is copied and becomes row 3001 so the hierarchy column of the row 3001 is "3000.3001" and row 3000 is deactiveated (bit column set from 1 to 0)

    b) Step a is required to view the calculations at any point in time for tracing purposes.

    c) The copying of data is performed using INSERT INTO xxx (,,,) SELECT ... FROM yyy.

    d) The load tables have clustered indexes on an internal record number and non-clustered indexes.

    5. Commit transaction.

    The performance problem occurs at steps 3 and 4 above. At first I thought the performance problem was caused by insufficient amount of RAM and a single drive so I boosted the RAM to 4GB and dedicated 2GB to SQL. In addition, I added a second drive which is used for log files only. After profiling, I discovered the cache hit ratio was above 85% and SQL was using much less than the assigned 2GB. The physical disk I/O was very high but the modified timestamp on files did not change.

    I don't understand why there is a high physical disk writes and high cache hit ratio and less than 1GB ram used. Shouldn't SQL use all the assigned RAM? If there is high disk I/O why didn't the timestamp changes on the files.

    The database is configured as follows:

    1. Creating the database and log with ample size on different disks to ensure auto growth is not required and disk I/O between data and log are not hitting the same drive.

    2. Auto close and auto shrink are disabled.

    3. Simple logging is used.

    4. TRUNCATE is used to remove data from the bulk insert tables.

    5. The database is not configured for single user mode but only one user will be accessing the database.

    6. WinXP, SQL Server 2000 with latest SP and patches.

    Thanks in advance,

    srv

  • Therre are a few things you can check on.

    Maximum memory allocated to SQL Server (on EM, check on properties)

    Updating the tables with triggers will definitly will more work to the server. Think about disabling the trigger and amnually updating the hierarchy.

    Better to remove the indexes and re-create them after data is being loaded.

    Another thing you can consider is breaking the INSERT statement for a certain number of records. This will remove some of the locks.

    Best Wishes!

    Cheers,
    Prithiviraj Kulasingham

    http://preethiviraj.blogspot.com/

  • Swap steps 1 & 2, Since you state that you are using a stage tables (emptied by TRANCATE TABLE) for bulk loading, move the BEGIN TRANS after the bulk loading.

    Even then you will see improvement if you move all of the remaining steps into a server side operation like a script or stored procedure, especially the Transaction handling.

    You may also want to increase the size of the LOG file before this transaction, even though you are in simple recovery mode, the LOG file will grow very large recording the transaction until is committed or rolled back.

    Andy

  • I just noticed your OS is XP, best to use a Server OS to take advantage of the /3GB switch for BOOT.INI, SQL Server 2000 Developer Edition will only use 2GB of RAM on XP, and XP is probably configured for "Programs",

    See: My Computer | Properties | Advanced | Performance | Settings | Advanced.

    Andy

  • Couple of things.

    a) Do the clustered indices help support user queries, or "are they just there...because of a historical decision"...because if the data is not being loaded in a manner that simplifies page splits, etc, then the clustered indices could be a hindrance.

    b) I presume your select ... from yyy query is performing optimially utilising any and all indices.

    c) I'd break the whole operation into 3 transactions - bulk insert, bulk load and calculation load...with a 'programmed backout script' to be used if a failure is detected in the 'overall business transaction'.  Using the inbuilt SQL transaction functionality to cover the full action has an 'insurance' cost to you that seems to be too high.  Trading off for a lower daily data loading cost for a higher 'failure recovery' cost may be a choice that has to be evaluated.

    d) Breaking step 3 and 4 into multiple mini-steps, whereby the commit point is smaller would be a suggestion that I would concur with.

    e) It's important that your user understands the cost of 'staying as is' v the cost of 'moving to a new data load method' and if the new process is proven to be faster (for 95-99% of the time), then moving to that option may be palatible.

    f) How often does a rollback happen at the moment?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply