Introduction
Most enterprise class systems, with the help of their interfacing systems, tend to accumulate huge amounts of data over time. Unchecked data growth and distribution has increased risk, storage and management costs. Management of years of production data within a single system also impacts service levels and disaster recovery initiatives. [1]
This leads to the unquestionable need of purging out old data after archiving it for future use and/or for compliance purposes. Purging data from any system can be done in multiple ways, but not all of them are optimal. Recently, I was faced with an assignment to help out a team at work who were struggling to maintain a pre-defined purging window for an inherited enterprise system which was undergoing some enhancements. The entire exercise was very interesting for me and hence, I thought it would be wonderful to share my research with the community and also get some feedback on the various methods that you have found to be successful over the years. Today, I will attempt to demonstrate the approach I felt to be most suitable for the assignment I had been given.
Delete Scenarios
Deletion of data is a very serious operation; once deleted, chances are the data is gone forever. Names, phone numbers, addresses, order information, stock status, and more will be lost as soon as the transaction is committed. We can classify all delete operations into the following two categories:
- Instantaneous delete - immediate; generally used when deleting a single record (produced maybe by a user error or some similar scenario)
- Batch delete - asynchronous; generally used when deleting a subset of data
Each of these can further be classified into:
- Soft delete - no permanent data removal, easily recoverable from user errors
- Hard deletes - permanent data loss, not recoverable without DBA intervention (provided the recovery models support this)
Over the years, a lot has been written and debated about the merits and demerits of soft vs. hard deletes, and hence, I would not be addressing either of these in this article.
My assignment had required me to come up with solutions for deleting data from both a table (clustered) and a heap (non-clustered). Also, another important constraint upon us was that the database size must not increase as a result of the purge.
Building the Scenario
The basic scenario is that we have an external system that pumps in a huge amount of data - about 85000 records/day, or approximately one record per second. Over time, we have gathered more than 75,000,000 rows, of which about 10,000,000 rows need to be purged.
The records to be removed may or may not be in sequence. This would be determined by a number of factors, one of which included whether or not a particular set of records is archived which depended upon the frequency of use of the record.
For this particular article, I will divide the exercise into two parts:
- I will use a random set for demonstrating deletes from a table and,
- A sequential set for demonstrating deletes from a heap
The Environment
As mentioned earlier, we had received inherited this database. One of the first things that we had noticed was that the database had default values for initial file size and growth. Running frequent, large-batch updates/deletes/inserts impacts the overall database performance because
- For all files, the free space initialization and allocation time increases with each new growth request
- For log files, a large number of VLF (Virtual Log Files) are created (http://www.sqlskills.com/BLOGS/KIMBERLY/post/Transaction-Log-VLFs-too-many-or-too-few.aspx)
To better handle this, and to ensure that the SQL Server spends less time in requesting more disk space from the OS, we changed the initial file size and the growth increments.
Because the server that I will be running today's tests on is not a fancy one (in fact it is one of my development boxes, and does not even classify as a production-grade machine), the initial file size and growth values will be set to the following:
- Data File size - around 1GB initial, 1MB growth increment
- Log File size - around 800MB initial, 2GB growth increment
Parameter | Value |
---|---|
Server OS | Windows Server 2008 (Standard) |
SQL Server version | SQL 2008, SQL 11 ("Denali") |
Server Type | Virtual Machine |
RAM | 1.5 GB |
Processor | Intel Core 2 Duo @ 2.80GHz |
Hard-drive type | Virtual ATA Drives |
Deleting Random Data from a Table
First of all, let's generate the test data required for our little experiment. Please note that throughout this article, the scripts mentioned have been tested on SQL Server 2008 and above (R2 and "Denali").
The steps that we will follow are:
- Create a test database
- Create a test table - note that because we expect a large number of records, we have the key value set as a BIGINT
- Create the test data - for my test, I generated about 75,000,000 rows
--Create a test database USE [master] GO CREATE DATABASE [DeleteRows] ON PRIMARY ( NAME = N'DeleteRows', FILENAME = N'E:\Databases\DeleteRows\DeleteRowsData.mdf', SIZE = 1065984KB, --Use an initial size higher than the default MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ) --Use an initial size higher than the default LOG ON ( NAME = N'DeleteRows_log', FILENAME = N'E:\Databases\DeleteRows\DeleteRowsLog.ldf', SIZE = 894336KB, --Use an initial size higher than the default MAXSIZE = 2048GB, FILEGROWTH = 25%) --Use an initial size higher than the default GO --Create a Test Table USE [DeleteRows] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO SET ANSI_PADDING ON GO CREATE TABLE [dbo].[DeleteRowsTbl]( [UnixDate] [BIGINT] IDENTITY(315532800,1) NOT NULL, [SomeName] [VARCHAR](100) NULL, PRIMARY KEY CLUSTERED ( [UnixDate] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] GO SET ANSI_PADDING OFF GO --Generate some test data USE [DeleteRows] GO INSERT INTO [dbo].[DeleteRowsTbl] SELECT msc.name + CONVERT(VARCHAR(10),ROUND(RAND()*1000,0)) FROM msdb.sys.objects mso (NOLOCK) CROSS JOIN msdb.sys.columns msc (NOLOCK) GO 40 -- GO is used as a batch iterator -- Check the database size USE [DeleteRows] GO --Confirm that the record count is around the number we need SELECT COUNT(*) FROM [dbo].[DeleteRowsTbl] --Check the DB size SELECT DB_NAME(database_id) AS 'Database Name', Name AS 'Logical File Name', Physical_Name AS 'Physical File Name', (size*8)/1024 AS 'Datbase Size (In MB)' FROM sys.master_files WHERE DB_NAME(database_id) = DB_NAME() GO
At the end of this exercise, I checked the size of my database. Please note the size statistics below:
Parameter | Value |
---|---|
# of records generated | 77495040 |
Size of Data file | 2694 MB |
Size of Log file | 1705 MB |
The Purge Strategy - Using Lookup tables
In order to delete random values, most novice database programmers would simply start dynamically generating a huge WHERE clause, which is simply overkill on the server. Working on clustered tables allows us to leverage the indexes defined on them. This means that we can use a lookup table for the deletes.
The proposed concept is quite simple. The assumption is that the records to be purged are collected or "marked" over time and then the purge happens maybe once a week or so. The primary key for the records to be purged can be inserted into an indexed lookup table, which can then be used during the delete.
Let us create our lookup table, and populate it with some data, simulating the collection of records being marked for purge.
USE [DeleteRows] GO SET NOCOUNT ON CREATE TABLE MyDeleteLookup (UnixDateToDelete BIGINT) GO BEGIN TRANSACTION GO CREATE NONCLUSTERED INDEX IX_MyDeleteLookup ON dbo.MyDeleteLookup ( UnixDateToDelete ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] GO ALTER TABLE dbo.MyDeleteLookup SET (LOCK_ESCALATION = TABLE) GO COMMIT --Populate the Look up table INSERT INTO MyDeleteLookup SELECT UnixDate FROM DeleteRowsTbl WHERE UnixDate LIKE '%10%' --Check how many records we will be purging SELECT COUNT(*) FROM dbo.MyDeleteLookup
For my test, the lookup table was populated with a total of about 4,591,769 rows (about 5% of available data). Now, let's use this lookup table to purge the respective rows from the parent table.
Running the Purge
Purging data from the parent table is now a simple matter of joining it with the lookup table and executing the delete.
--Run the Purge USE [DeleteRows] GO SELECT GETDATE() AS 'Start DateTime' DELETE Mytbl FROM [dbo].[DeleteRowsTbl] Mytbl INNER JOIN MyDeleteLookup Mylkp ON Mytbl.UnixDate = Mylkp.UnixDateToDelete SELECT GETDATE() AS 'End DateTime' --Check the database size USE[DeleteRows] GO --Confirm that the record count is around the number we need SELECT COUNT(*) FROM [dbo].[DeleteRowsTbl] --Check the DB size SELECT DB_NAME(database_id) AS 'Database Name', Name AS 'Logical File Name', Physical_Name AS 'Physical File Name', (size*8)/1024 AS 'Datbase Size (In MB)' FROM sys.master_files WHERE DB_NAME(database_id) = DB_NAME() GO
Let's take a quick look at the delete statistics:
Parameter | Value |
---|---|
Data file size (Before) | 3257 MB |
Log file size (Before) | 9254 MB |
Time taken for the delete | 10 min, 1 sec |
Data file size (After) | 3257 MB |
Log file size (After) | 9254 MB |
On the virtual hardware that was running my test, I believe the 10min. 1 sec time taken for the purge is quite reasonable and efficient. If we observe the execution plan, we will notice that this is only possible because of the lookup table.
Also, one thing that we notice immediately is that the "before" and "after" size of the database remained the same. For the log file, this is because I had a high initial size and high growth specified and hence more continuous free space was available to log the delete operation. As for the data file size, a delete does not release the physical space back to the OS and hence the physical file size remains the same as before.
Deleting Data from a Heap (non-clustered table)
Sometimes, especially in case of data coming over from non-SQL systems (or non-RDBMS in general), the information being exchanged is in the form of a heap of data. As per MSDN, "A heap table, by definition, is a table that doesn't have any clustered indexes. Different pages of the heap-based table occupy different non-contiguous areas on a disk, and they are not linked together in any way." In our case, we had one such legacy interface sending data in as a heap. Because of multiple other dependencies, it was not be possible to change the design such that we could define a clustered index on the heap.
Deleting data from a heap is much more difficult because we (and even SQL Server) does not have the clustered index to take advantage of. The lookup table method demonstrated above can still be used to delete random data and while it will defintely help, it won't be as performant it is with the table.
One of the first recommendations that we made to the customer was to work with the vendor such that it would allow us to add a primary key column to the heap. However, that will be taking some time, so let's not worry about it right now.
When working with a heap, what becomes most important is to free up the table at regular intervals so as not to increase the bottlenecks to an unacceptable level. The most common way to do so is to limit the number of records being processed at any given point in time.
SET ROWCOUNT - the most common solution
In order to delete a range, normally we would have used the SET ROWCOUNT number option.
This solution has one major drawback - it is a deprecated feature in SQL Server 2008 and will not affect DELETE, INSERT, and UPDATE statements in the next release of SQL Server (SQL11/"Denali"). Hence, any new solutions should NOT be using this feature.
(Reference: http://msdn.microsoft.com/en-us/library/ms188774.aspx)
TOP - the alternate to SET ROWCOUNT
In order to limit the number of rows being processed, Microsoft has another offering - the TOP...PERCENT clause. You may read more about these here:
Limiting Result Sets by TOP and PERCENT
Essentially, the prime difference between the TOP and ROWCOUT is that the query optimizer can use the value of expression in the TOP clause as part of generating an execution plan for a query. Let's see how we can delete large amounts of data from a heap using the TOP clause.
Building the test data
I will provide the entire script to start from a new database below (as such, the number of rows we will be purging will be similar as can be seen from the statistics mentioned after the script).
--Create the test database USE [master] GO CREATE DATABASE [DeleteRows] ON PRIMARY ( NAME = N'DeleteRows', FILENAME = N'E:\Databases\DeleteRows\DeleteRowsData.mdf', SIZE = 1065984KB, --Use an initial size higher than the default MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ) --Use an initial size higher than the default LOG ON ( NAME = N'DeleteRows_log', FILENAME = N'E:\Databases\DeleteRows\DeleteRowsLog.ldf', SIZE = 894336KB, --Use an initial size higher than the default MAXSIZE = 2048GB, FILEGROWTH = 25%) --Use an initial size higher than the default GO --Create a test non-clustered table, i.e. heap USE [DeleteRows] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO SET ANSI_PADDING ON GO --Note that this is a heap - No Clustered Index! CREATE TABLE [dbo].[DeleteRowsTbl]( [UnixDate] [BIGINT] IDENTITY(315532800,1) NOT NULL, [SomeName] [VARCHAR](100) NULL ) ON [PRIMARY] GO SET ANSI_PADDING OFF GO --Generate some test data USE [DeleteRows] GO INSERT INTO [dbo].[DeleteRowsTbl] SELECT msc.name + CONVERT(VARCHAR(10),ROUND(RAND()*1000,0)) FROM msdb.sys.objects mso (NOLOCK) CROSS JOIN msdb.sys.columns msc (NOLOCK) GO 40 --Check the database size USE[DeleteRows] GO --Confirm that the record count is around the number we need SELECT COUNT(*) FROM[dbo].[DeleteRowsTbl] --Check the DB size SELECT DB_NAME(database_id) AS'Database Name', NameAS'Logical File Name', Physical_NameAS'Physical File Name', (size*8)/1024 AS'Datbase Size (In MB)' FROM sys.master_files WHERE DB_NAME(database_id) = DB_NAME() GO
Running the purge
Now, let us run the purge using the TOP...PERCENT clause. Note that as mentioned at the beginning of the article, we will be purging a range of records for this example.
--Run the purge USE [DeleteRows] GO PRINTN'Starting Delete at: ' + CONVERT(NVARCHAR(25),GETDATE()) GO DECLARE @MinRangeValue BIGINT DECLARE @MaxRangeValue BIGINT DECLARE @tIterationLimit INT DECLARE @tRowCount INT DECLARE @tBatchCounter INT DECLARE @stmnt NVARCHAR(200) DECLARE @params NVARCHAR(200) SET @stmnt = N'DELETE TOP (@tIterationLimit) FROM DeleteRowsTbl WHERE UnixDate BETWEEN @tMinValue AND @tMaxValue' SET @params = N'@tIterationLimit INT, @tMinValue BIGINT, @tMaxValue BIGINT' SET @MinRangeValue = 320000000 --Some start value (valid) SET @MaxRangeValue = 323874752 --Some end value (valid) SET @tIterationLimit = 1000000 --Remove only 1000000 at a time SET @tRowCount = 9999999 SET @tBatchCounter = 0 WHILE (@tRowCount > 0) BEGIN BEGIN TRANSACTIONDeleteRecords EXEC sp_executesql @stmnt,@params, @tIterationLimit = @IterationLimit, @tMinValue = @MinRangeValue, @tMaxValue = @MaxRangeValue; --Standard Error handling (TRY...CATCH, etc) must go here SET @tRowCount = @@ROWCOUNT SET @tBatchCounter += 1 COMMIT TRANSACTIONDeleteRecords --You may introduce a little wait here so that the heap is free for other operations. In this example, we won't because nothing's waiting for this heap --WAITFOR DELAY '00:05' END GO PRINT N'Ending Delete at: ' + CONVERT(NVARCHAR(25),GETDATE()) GO
--Determine the database size USE[DeleteRows] GO --Confirm that the record count is around the number we need SELECT COUNT(*) FROM[dbo].[DeleteRowsTbl] --Check the DB size SELECT DB_NAME(database_id) AS'Database Name', NameAS'Logical File Name', Physical_NameAS'Physical File Name', (size*8)/1024 AS'Datbase Size (In MB)' FROM sys.master_files WHERE DB_NAME(database_id) = DB_NAME() GO --The most imporant of all, cleanup the environment USE master GO ALTER DATABASE DeleteRows SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE DeleteRows; GO
Parameter | Value |
---|---|
Data File Size (Before) | 2678 MB |
Log File Size (Before) | 1364 MB |
# of rows deleted | 3,874,752 |
Time taken for the purge | 6 minutes |
Data File Size (After) | 2678 MB |
Log File Size (After) | 1364 MB |
As an alternate to the TOP (..) PERCENT clause, you may also want to experiment with dynamically modifying the min and max. values of the range as a sliding window (i.e. the first iteration of the delete happens from records n to n+m, the next round from (n+m) to (n+m+p) and so on).
Do non-clustered indexes on a heap help?
As I was researching this and discussing it with some of my friends, one of them thought out aloud - We cannot have a clustered index on our heap, but can we have a non-clustered index that would help us in speeding up the delete?
What better way to answer the question than to run a simple test? I started off from scratch and just after Step #2 (creation of the table), I ran the following changes and then ran the rest of the script for the delete to find out.
--Create a non-clustered index --(ADDITIONAL TEST ONLY) IF EXISTS (SELECT name FROM sys.indexes WHERE name = N'Idx_NC_DeleteRowsTbl' AND object_id = OBJECT_ID(N'dbo.DeleteRowsTbl')) DROP INDEX Idx_NC_DeleteRowsTbl ON dbo.DeleteRowsTbl; GO CREATE NONCLUSTERED INDEX Idx_NC_DeleteRowsTbl ON [dbo].[DeleteRowsTbl] (UnixDate)
After the non-clustered index was setup, and the test data generated, we ran the purge. The time taken for the purge was now: 4 minutes!
Conclusion
For a clustered table, the approach of having a lookup table is both tried and tested - our code has now been running in production since the last couple of months, and the purging window has dropped by more than half of what it was previously.
As to the heap, besides the obvious storage needs for the index and the maintenance costs, the addition of a non-clustered index definitely helps in deletion of records from a heap. A combination of non-clustered index and the use of the TOP clause makes deletions from heaps a lot faster, and that is what we went ahead with.
References
[1] Christy Pettey, Ben Tudor, "Data Growth as the Largest Data Center Infrastructure Challenge", Gartner Survey , http://www.gartner.com/it/page.jsp?id=1460213
[2] Robert Sheldon, "The DELETE statement in SQL Server", Simple Talk Publishing, http://www.simple-talk.com/sql/learn-sql-server/the-delete-statement-in-sql-server/
Before we say good-bye
Before we say good-bye, I would welcome your feedback this article. More importantly, do you face any need to delete large quantities of data from a VLDB? Do leave your inputs in the discussion associated with this article.
If you liked this one, I would also request you to visit http://www.sqlservercentral.com/articles/Connection+Strings/71669/ to read my article on an introduction to Connection Strings and their importance. In the article, I also provide a simple Quick Reference Sheet that you can use to write your connection strings. I hope you find it very useful.
Until we meet next time,
Be courteous. Drive responsibly.