Deleting Millions of Rows from Multiple Tables

Question

Post reply

Deleting Millions of Rows from Multiple Tables

homebrew01

SSC Guru

Points: 55548
More actions
May 1, 2009 at 10:19 am

#129755

We need to archive data from our production database. The data to be archived is driven by a campaign code, and there are about 4000 campaign codes to be removed. Each campaign code has associated records in 6 other tables, a few of which are huge. I expect a total of 100,000,000 records to be deleted among all the campaigns. The vendor has a stored procedure written that is designed to delete just 1 campaign at a time via a parameter of 1 campaign code, and join to each related table and delete the appropriate records.
I'm wondering what the best way is to utilize this code, and use it to delete all 4000 campaings. I have a work table with the 4000 codes in it. I'd also like the ability to cancel the job, then pick up where I left off next time, so I'll need to set a flag in the work table when a campaign has been deleted.
One idea that comes to mind is to use a cursor to read through the 4000 campaings one at a time, and pass each campaign code to the stored procedure for deletion, set the "completed" flag, then get the next campaign. I suspect there's a better way that will avoid the evil cursor !
I'll also have to consider the impact on the transaction log.
TIA

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 1

Can you post the generalized schema of the tables, indexes, RI (if any), some sample data (readily consumable)?

This looks like fun.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

Any additional criteria for the delete would also be helpful.

GSquared SSC Guru Points: 260824 More actions · Answer 3

For something like that, I probably would use a cursor to step through the codes.

To avoid blowing up the tran log, I'd probably add a log backup step after every X iterations, though you'd have to make sure you had disk space for the backups if you do that.

That would allow for "stop and pick up again" type operations.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

homebrew01 SSC Guru Points: 55548 More actions · Answer 4

Here are the main table schemas (sorry for all the bandwidth - I tried to de-clutter it). The stored procedure is written to use a cursor for the table deletes in "nolock" mode, or to delete in blocks of "top 500" and loop in "FAST" mode. I can post data too, but it might be a bit wide.... or should it be an attachnemt ?

/****** Object: Table Campaigns -- 6,843 Total Rows ******/

.

CREATE TABLE [Campaigns](

[ID] [uniqueidentifier] ROWGUIDCOL NOT NULL CONSTRAINT [DF_Campaigns_ID] DEFAULT (newid()),

[Code] [char](20) NOT NULL,

[Name] [varchar](50) NULL,

[Type] [char](1) NOT NULL,

[CustomData] [varchar](1024) NOT NULL,

[PhoneField] [char](25) NOT NULL CONSTRAINT [DF_Campaigns_PhoneField] DEFAULT ('Phone'),

[ExternalIdField] [char](25) NULL CONSTRAINT [DF_Campaigns_ExternalIdField] DEFAULT ('IntCode'),

[DateFrom] [datetime] NULL,

[DateTo] [datetime] NULL,

[TsrLayout] [char](10) NULL,

[ScriptId] [char](10) NULL,

[VerLayout] [char](10) NULL,

[CallRulesId] [uniqueidentifier] NOT NULL,

[CustomerId] [uniqueidentifier] NOT NULL,

[InventoryField] [char](25) NULL CONSTRAINT [DF_Campaigns_InventoryField] DEFAULT ('State'),

[VerificationTo] [datetime] NULL,

[NameField] [varchar](255) NULL CONSTRAINT [DF_Campaigns_NameField] DEFAULT ('Name'),

[PredictionId] [uniqueidentifier] NULL,

.

CONSTRAINT [PK_Campaigns] PRIMARY KEY NONCLUSTERED

([ID] ASC) ON [PRIMARY]) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_Campaigns_Code] ON [Campaigns]

([Code] ASC) ON [INDEXES]

.

/****** Object: ForeignKey [FK_Campaigns_CallRules] ******/

ALTER TABLE [Campaigns] WITH NOCHECK ADD CONSTRAINT [FK_Campaigns_CallRules] FOREIGN KEY([CallRulesId])

REFERENCES [CallRules] ([ID])

ALTER TABLE [Campaigns] NOCHECK CONSTRAINT [FK_Campaigns_CallRules]

.

/****** Object: Table CallLogDetail 87,673,640 Total Rows ******/

CREATE TABLE [CallLogDetail](

[CallLog_FK_ID] [uniqueidentifier] NOT NULL,

[Start] [datetime] NOT NULL,

[Finish] [datetime] NOT NULL,

[Agent] [char](38) NOT NULL,

[Action] [char](3) NOT NULL,

[ActionFailed] [char](1) NOT NULL,

[Params] [varchar](1024) NULL,

[Notes] [varchar](1024) NULL

) ON [PRIMARY]

.

CREATE NONCLUSTERED INDEX [IX_CallLogDetail] ON [CallLogDetail]

([CallLog_FK_ID] ASC) ON [INDEXES]

CREATE NONCLUSTERED INDEX [IX_CallLogDetail_Start] ON [CallLogDetail]

([Start] ASC) ON [INDEXES]

.

/****** Object: Table CallMaster -- 70,292,576 Total Rows ******/

CREATE TABLE [CallMaster](

[Campaign_ID_FK] [uniqueidentifier] NOT NULL,

[Customer_ID] [char](20) NOT NULL,

[DoNotCall] [char](1) NOT NULL CONSTRAINT [DF_CallMaster_DoNotCall] DEFAULT ('N'),

[CallAfter] [datetime] NOT NULL CONSTRAINT [DF_CallMaster_CallAfter] DEFAULT (getdate()),

[LastResult] [uniqueidentifier] NULL,

[LastCount] [int] NOT NULL CONSTRAINT [DF_CallMaster_LastCount] DEFAULT (0),

[LastCalled] [datetime] NOT NULL CONSTRAINT [DF_CallMaster_LastCalled] DEFAULT (getdate()),

[Attempts] [int] NOT NULL CONSTRAINT [DF_CallMaster_Attempts] DEFAULT (0),

[TimeZone] [char](10) NOT NULL,

[Name] [char](50) NULL,

[Version] [timestamp] NOT NULL,

[LastVerified] [datetime] NULL,

[DialingLaws] [varchar](50) NULL,

[Callback] [char](1) NULL,

[ReserveCSR] [char](1) NULL,

[TargetCSR] [varchar](38) NULL,

[InCache] [datetime] NULL,

[WhatNumber] [int] NULL,

[PrefDialOrder] [varchar](50) NULL,

CONSTRAINT [PK_CallMaster] PRIMARY KEY NONCLUSTERED

([Campaign_ID_FK] ASC,

[Customer_ID] ASC) ON [INDEXES]) ON [PRIMARY]

.

CREATE NONCLUSTERED INDEX [IX_CallMaster_Attempts] ON [CallMaster]

([Campaign_ID_FK] ASC,

[Attempts] ASC) ON [INDEXES]

CREATE NONCLUSTERED INDEX [IX_CallMaster_CallAfter] ON [CallMaster]

([Campaign_ID_FK] ASC,

[DoNotCall] ASC,

[Callback] ASC,

[CallAfter] ASC) ON [INDEXES]

CREATE NONCLUSTERED INDEX [IX_CallMaster_CustomerID] ON [CallMaster]

([Customer_ID] ASC) ON [PRIMARY]

.

/****** Object: Table CallbackActivity -- 12,026,593 Total Rows ******/

CREATE TABLE [CallbackActivity](

[CallLogId] [uniqueidentifier] NOT NULL,

[Start] [datetime] NOT NULL,

[Resolved] [char](1) NOT NULL,

[GivenUp] [char](1) NOT NULL,

[TargetTime] [datetime] NOT NULL,

[TargetCSR] [varchar](38) NULL,

CONSTRAINT [PK_CallbackActivity] PRIMARY KEY CLUSTERED

([CallLogId] ASC) ON [PRIMARY]) ON [PRIMARY]

.

/****** Object: Table StatCell -- 9,659,422 Total Rows ******/

CREATE TABLE [StatCell](

[Id] [uniqueidentifier] ROWGUIDCOL NOT NULL,

[Start] [datetime] NOT NULL,

[CampaignId] [uniqueidentifier] NULL,

[QueueId] [uniqueidentifier] NULL,

[AgentId] [varchar](38) NULL,

[Dnis] [varchar](50) NULL,

[CustomField] [varchar](100) NULL,

[CallType] [char](1) NULL,

[CallSubType] [char](1) NULL,

[Tier] [int] NULL,

[TierPos] [char](1) NULL,

[OrgAgentId] [varchar](38) NULL,

[Skill] [varchar](50) NULL,

[End] [datetime] NULL,

CONSTRAINT [PK_StatCell] PRIMARY KEY CLUSTERED

([Id] ASC) ON [PRIMARY]) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [IX_StatCell_Start] ON [StatCell]

([Start] ASC,

[CampaignId] ASC,

[AgentId] ASC,

[CallType] ASC,

[CallSubType] ASC,

[Tier] ASC) ON [PRIMARY]

.

/****** Object: Table CallLog --145,531,505 Total Rows ******/

CREATE TABLE [CallLog](

[ID] [uniqueidentifier] ROWGUIDCOL NOT NULL CONSTRAINT [DF_CallLog_ID] DEFAULT (newid()),

[Campaign_ID] [uniqueidentifier] NOT NULL,

[Customer_ID] [char](20) NOT NULL,

[InitialQueue] [uniqueidentifier] NOT NULL,

[Start] [datetime] NOT NULL,

[VoiceStart] [datetime] NOT NULL,

[VoiceEnd] [datetime] NOT NULL,

[Released] [datetime] NOT NULL,

[DialerCode] [char](5) NOT NULL,

[Disposition] [uniqueidentifier] NULL,

[CreditTo] [char](38) NOT NULL,

[Recording] [image] NULL,

[WaitTime] [datetime] NULL,

[SeqNumber] [int] NULL,

[CallType] [char](1) NULL,

[CallSubType] [char](1) NULL,

CONSTRAINT [PK_CallLog] PRIMARY KEY NONCLUSTERED

([ID] ASC) ON [INDEXES]) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

CREATE NONCLUSTERED INDEX [Customer_ID] ON [CallLog]

([Customer_ID] ASC) ON [INDEXES]

CREATE NONCLUSTERED INDEX [IX_CallLog_CreditTo] ON [CallLog]

([Campaign_ID] ASC,

[CreditTo] ASC,

[Start] ASC) ON [INDEXES]

CREATE NONCLUSTERED INDEX [IX_CallLog_Disposition] ON [CallLog]

([Campaign_ID] ASC,

[Disposition] ASC,

[Start] ASC) ON [INDEXES]

CREATE NONCLUSTERED INDEX [IX_CallLog_FK] ON [CallLog]

([Campaign_ID] ASC,

[Customer_ID] ASC) ON [INDEXES]

CREATE NONCLUSTERED INDEX [IX_CallLog_Start] ON [CallLog]

([Start] ASC) ON [INDEXES]

David Benoit SSC-Dedicated Points: 34562 More actions · Answer 5

This definitely seems like a "case for cursors". 🙂 They're not always bad, especially for maintenance tasks. As GSquared stated you could execute log backups periodically if needed to avoid log blowup. Using dbcc sqlperf (logspace) and looking at % log space used periodically in your cursor could provide you with an intelligent method of doing that as well.

When you start a campaign set a status flag as in progress and after deleting the last batch set it to completed. Doing that should allow you to kick off the script and stop it at any time and pick up where you left off.

Just a thought too, don't forget to at least update stats during this process and as soon as possible rebuild indexes as appropriate so that you don't cause the application performance to drop off.

David

@SQLTentmaker

“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 6

Since most of your tables are heaps (no clustered indexes) you may run into severe performance issues with you application that you cannot fix without either reloading the table completely or creating a clustered index.

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 7

Hi

I have to confirm Michael, clustered indexes should be essential.

As second I'm not sure if you forgot to script all foreign keys or not. If not it is possible that your referential integrity is broken in some cases. You should first remove the zombie rows (if exist)

Your table StatCell seems to be related to your campaign table but there is no index which starts with. I don't know how many data within this table but this might become a performance problem for purging.

I would not use a cursor to delete the data. We had one to handle our archiving and had many problems with locks. I would create a temp table, insert a batch of IDs of your campaign table, delete all related data, delete the campaign data. Use small transactions to be commited.

Just some suggestions from my side...

Greets

Flo

george sibbald SSC Guru Points: 104210 More actions · Answer 8

remove the need to monitor the log space used from the procedure by setting up a performance alert in SQLAgent to backup the log when it reaches a certain percentage space used. An initial run of x campaigns should give you a good idea of how fast log is used up so you can set threshold accordingly (remember it needs time to back up the inactive log before the threshold is reached again).

This of course is presuming recovery mode is not simple.

---------------------------------------------------------------------

David Benoit SSC-Dedicated Points: 34562 More actions · Answer 9

george sibbald (5/1/2009)
remove the need to monitor the log space used from the procedure by setting up a performance alert in SQLAgent to backup the log when it reaches a certain percentage space used. An initial run of x campaigns should give you a good idea of how fast log is used up so you can set threshold accordingly (remember it needs time to back up the inactive log before the threshold is reached again).
This of course is presuming recovery mode is not simple.

Very nice. Forgot about doing that too. 🙂

David

@SQLTentmaker

“He is no fool who gives what he cannot keep to gain that which he cannot lose” - Jim Elliot

GSquared SSC Guru Points: 260824 More actions · Answer 10

I think I'd go with something at the beginning of the loop in the proc that checks the log and backs it up as needed.

That way, you don't have the concurrency issues that you might otherwise run into.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

homebrew01 SSC Guru Points: 55548 More actions · Answer 11

george sibbald (5/1/2009)
remove the need to monitor the log space used from the procedure by setting up a performance alert in SQLAgent to backup the log when it reaches a certain percentage space used. An initial run of x campaigns should give you a good idea of how fast log is used up so you can set threshold accordingly (remember it needs time to back up the inactive log before the threshold is reached again).
This of course is presuming recovery mode is not simple.

Good idea, but how does log percentage work ? Since the log can grow as needed, what is the number that the percentage is calculated from ?

Thanks for all the other great responses !

This is a 3rd party application, and the lack of clustered indexes seems strange, but I don't think I should alter their tables.

george sibbald SSC Guru Points: 104210 More actions · Answer 12

GSquared (5/1/2009)
I think I'd go with something at the beginning of the loop in the proc that checks the log and backs it up as needed.
That way, you don't have the concurrency issues that you might otherwise run into.

could you expand on that gus? do you mean avoid backing up the log at the same time you are writing to it?

You do need to be careful where you set the percentage full but its a damn good failsafe option.

---------------------------------------------------------------------

george sibbald SSC Guru Points: 104210 More actions · Answer 13

homebrew01 (5/1/2009)
george sibbald (5/1/2009)
remove the need to monitor the log space used from the procedure by setting up a performance alert in SQLAgent to backup the log when it reaches a certain percentage space used. An initial run of x campaigns should give you a good idea of how fast log is used up so you can set threshold accordingly (remember it needs time to back up the inactive log before the threshold is reached again).
This of course is presuming recovery mode is not simple.
Good idea, but how does log percentage work ? Since the log can grow as needed, what is the number that the percentage is calculated from ?
Thanks for all the other great responses !
This is a 3rd party application, and the lack of clustered indexes seems strange, but I don't think I should alter their tables.

a straight forward percentage used of the current tran log file size, same as dbcc sqlperf(logspace) would give you. will help prevent unnecessary growth in the size of your tran log file.

---------------------------------------------------------------------

Steve Jones - SSC Editor SSC Guru Points: 736255 More actions · Answer 14

there shouldn't be an issue with backing up the log while writing to it. One is reading, one writing, but SQL and the OS should manage that fine. There could be a performance issue if you are writing to the same drive for logging and backup.