I have a very large table and I'm looking to do a cleanup based upon a non-keyed field. This table many concurrent insert transactions running against it, and I can't afford downtime to create an Index this field. Looking for some suggestions using the PK(ID) to help speed up the deletes and minimize Locking on the table.
Right now driver for cleanup is something like this:
delete from Image_Classification_Master
where
convert(date,SpoolStartDt) = '02/22/2022'
Thanks.
CREATE TABLE [dbo].[Image_Classification_Master](
[Id] [int] IDENTITY(1,1) NOT NULL,
[ImageName] [nvarchar](500) NULL,
[ImageType] [nvarchar](200) NULL,
[ImageSource] [image] NULL,
[ReceivedDateTime] [datetime] NULL,
[ImagePath] [nvarchar](500) NULL,
[Site] [varchar](3) NULL,
[MachineNbr] [int] NULL,
[LineNbr] [int] NULL,
[TakeUpNbr] [int] NULL,
[SpoolNbr] [int] NULL,
[ImageIndex] [int] NULL,
[CameraNbr] [varchar](2) NULL,
[SpoolStartDt] [datetime] NULL,
[SpoolStartTime] [time](7) NULL,
[DefectDate] [datetime] NULL,
[DefectTime] [time](7) NULL,
[DefectNbr] [int] NULL,
[DefectClass] [varchar](100) NULL,
[Reviewer] [nvarchar](50) NULL,
[UserDefectInput] [nvarchar](10) NULL,
CONSTRAINT [PK_Image_Classification_Master] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
ALTER TABLE [dbo].[Image_Classification_Master] ADD CONSTRAINT [DF_Image_Classification_Master_ReceivedDateTime] DEFAULT (getdate()) FOR [ReceivedDateTime]
GO
November 15, 2024 at 3:31 pm
Oof. Even if you put an index on the SpoolSrtDt column, because you have a function to convert the value you're going to get scans anyway. Why do that? It's a datetime column. Compare it to a datetime value. Then, an index could help. Otherwise, you're just looking at scans and no way around that. Because the clustered index is ID, that's what must be used to delete values. No getting around it. So, they're either found through a scan of the clustered index, or, you build an index on the appropriate column (which will absolutely have some affect on the system, unless you're running Enterprise, then you can do an online index creation) and pay that cost so it doesn't have to do the table scan. No magic way around this really.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
November 15, 2024 at 5:48 pm
@ Bruin,
Which edition of 2016 do you have? Standard or Enterprise?
--Jeff Moden
Change is inevitable... Change for the better is not.
November 15, 2024 at 5:52 pm
@ Bruin,
Which edition of 2016 do you have? Standard or Enterprise?
Also, are there any FKs that are pointing AT this table?
--Jeff Moden
Change is inevitable... Change for the better is not.
November 15, 2024 at 6:12 pm
IF you're on Enterprise Edition, you can do an online build of an index for SpoolStartDt. Then you can use that index to do the DELETEs. If you're not on Enterprise, the index build would lock up the table and you should not try to create the index offline.
Also, you don't have to do a whole day at a time, if the query plan shows a scan or a full day takes too long to process, with all the other activity on the table. For example, here's an approach that would delete 4 hours' worth of old activity at a time.
CREATE UNIQUE NONCLUSTERED INDEX [IX_Image_Classification_Master] ON dbo.Image_Classification_Master ( SpoolStartDt, Id )
WITH ( DATA_COMPRESSION = ROW, FILLFACTOR = 95, ONLINE = ON, SORT_IN_TEMPDB = ON );
DELETE FROM dbo.Image_Classification_Master
WHERE SpoolStartDt >= '20220222 00:00' AND SpoolStartDt < '20220222 04:00'
DELETE FROM dbo.Image_Classification_Master
WHERE SpoolStartDt >= '20220222 04:00' AND SpoolStartDt < '20220222 08:00'
DELETE FROM dbo.Image_Classification_Master
WHERE SpoolStartDt >= '20220222 08:00' AND SpoolStartDt < '20220222 12:00'
DELETE FROM dbo.Image_Classification_Master
WHERE SpoolStartDt >= '20220222 12:00' AND SpoolStartDt < '20220222 16:00'
DELETE FROM dbo.Image_Classification_Master
WHERE SpoolStartDt >= '20220222 16:00' AND SpoolStartDt < '20220222 20:00'
DELETE FROM dbo.Image_Classification_Master
WHERE SpoolStartDt >= '20220222 20:00' AND SpoolStartDt < '20220223 00:00'
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
November 15, 2024 at 7:11 pm
When performing deletions on large tables, it is essential to delete rows using the clustered index (in this case, PK(Id)) and process the deletions in small batches. This approach ensures:
Efficiency: The clustered index provides a natural order for the rows in the table, making lookups and deletions faster. Deleting rows directly on the clustered index leverages this ordering, avoiding costly table scans and reducing overall I/O operations.
Minimised Locking: Deleting rows in small batches (e.g., 500 or 1000 rows at a time) ensures that locks on the table are held for shorter durations. This reduces the risk of blocking concurrent transactions that insert or update rows in the same table.
Reduced Transaction Log Impact: Large deletions in a single transaction can overwhelm the transaction log, especially if the table contains millions of rows. Batch processing limits the size of each transaction, keeping the transaction log manageable and ensuring that log backups and recovery operations remain efficient.
System Performance: Small batches prevent long-running transactions, which can degrade system performance and increase the likelihood of deadlocks. By committing each batch independently, the system can continue to process other workloads smoothly.
Always delete rows using the PK(Id) or another clustered index to optimise performance.
Limit the batch size to 500-2000 rows depending on the system's workload and the size of the table.
Monitor the process to ensure that the deletion batches are completing within acceptable time limits and adjust the batch size if necessary.
-- Declare batch size
DECLARE @BatchSize INT = 1000;
DECLARE @RowsAffected INT = 1;
-- Temporary table to hold IDs of rows to be deleted
DROP TABLE IF EXISTS #ToDelete;
CREATE TABLE #ToDelete (Id INT PRIMARY KEY CLUSTERED);
-- Temporary table to hold batch IDs
DROP TABLE IF EXISTS #BatchToDelete;
CREATE TABLE #BatchToDelete (Id INT PRIMARY KEY CLUSTERED);
-- Step 1: Insert IDs of rows to be deleted into the temporary table
INSERT INTO #ToDelete (Id)
SELECT Id
FROM dbo.Image_Classification_Master WITH(NOLOCK)
WHERE CONVERT(DATE, SpoolStartDt) = '2022-02-22'
OPTION (MAXDOP 1);
-- Step 2: Loop to delete rows in batches
WHILE @RowsAffected > 0 BEGIN
-- Begin transaction to ensure consistency
BEGIN TRANSACTION;
-- Populate the batch table with the next set of IDs to delete
;WITH cte AS
(
SELECT TOP (@BatchSize) ID
FROM #ToDelete
)
DELETE FROM cte
OUTPUT deleted.ID
INTO #BatchToDelete(ID);
-- Perform the deletion in the main table using EXISTS
DELETE dbo.Image_Classification_Master
WHERE EXISTS (SELECT 1
FROM #BatchToDelete b
WHERE b.Id = Image_Classification_Master.Id);
-- Get the number of rows affected in this batch
SET @RowsAffected = @@ROWCOUNT;
-- Clear the batch table
DELETE FROM #BatchToDelete;
-- Commit transaction
COMMIT TRANSACTION;
-- Optional: wait for delay (100ms) to give other processes using the system a chance
WAITFOR DELAY '00:00:00:100'
-- Optional: Print progress
RAISERROR('Deleted %d rows in this batch.', 0, 1, @RowsAffected) WITH NOWAIT;
END
-- Cleanup
DROP TABLE IF EXISTS #ToDelete, #BatchToDelete;
PRINT 'Deletion completed.';
November 15, 2024 at 7:15 pm
Looks like I need both of these fields for startdt and starttime
[SpoolStartDt] [datetime] NULL,
[SpoolStartTime] [time](7) NULL,
example:
2022-02-23 00:00:00.000
16:41:00.0000000
November 15, 2024 at 8:29 pm
Jonathan can your example include 2 fields I mention above? so I could use a 4hr window as also suggested.
Thanks.
Certainly! Simply include the criteria in the WHERE clause of the SELECT statement that retrieves the IDs.
INSERT INTO #ToDelete (Id)
SELECT Id
FROM dbo.Image_Classification_Master
WHERE CONVERT(DATE, SpoolStartDt) = '2022-02-22';
AND SpoolStartTime BETWEEN '16:00:00.0000000' AND '20:00:00.0000000';
How many rows do you need to delete?
November 16, 2024 at 1:15 pm
This is throwing an error:
-- Populate the batch table with the next set of IDs to delete
DELETE TOP (@BatchSize)
OUTPUT DELETED.Id INTO #BatchToDelete
FROM #ToDelete;
Msg 102, Level 15, State 1, Line 27
Incorrect syntax near 'DELETED'.
This is throwing an error:
-- Populate the batch table with the next set of IDs to delete DELETE TOP (@BatchSize) OUTPUT DELETED.Id INTO #BatchToDelete FROM #ToDelete;
Msg 102, Level 15, State 1, Line 27 Incorrect syntax near 'DELETED'.
Use this instead:
;WITH cte AS
(
SELECT TOP (@BatchSize) ID
FROM #ToDelete
)
DELETE
FROM cte
OUTPUT deleted.ID
INTO #BatchToDelete(ID);
November 16, 2024 at 2:22 pm
Perfect .. works great ..
Many Thanks...
November 16, 2024 at 2:40 pm
Perfect .. works great ..
Many Thanks...
I’ve revised my original response to correct the delete statement - perhaps it could serve as an example.
I’ve also added WITH (NOLOCK)
on the ID selection to make the operation even less intrusive on the production environment.
Using this method, you could process a day at a time. Be mindful of the batch size to ensure that each batch runs efficiently, ideally completing within 1 to 3 seconds. To further minimise any potential impact on online users, you might consider adding a small WAITFOR DELAY
between batches, such as:
WAITFOR DELAY '00:00:00:100'
This introduces a 100 ms pause between batches, giving online application users more of a chance to work without being affected by the archiving process.
Jeff raised a good point about foreign keys on other tables pointing to the one you’re deleting from. It’s important to check the behaviour configured for the foreign keys:
Additionally, ensure appropriate indexes exist on the foreign key columns to optimise performance and avoid issues during deletion.
November 16, 2024 at 5:32 pm
Thanks again this table is standalone ..
November 16, 2024 at 6:04 pm
How many rows are added to the table each day? And how many rows are currently in the table?
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply