November 9, 2006 at 3:38 pm
I know you guys will have some great ideas for this:
I've got a table with just over 100,000,000 rows. Each row can be as wide as 158 bytes. I need to delete around 70,000,000 of these rows and I'm seeking the fastest solution.
Some things to keep in mind:
Concurrency is an issue. Off peak SELECTS number 3-5/min, UPDATES 5-8/min INSERTS 0. For this reason, we have to keep blocking to an absolute minimum. If we decide to SELECT the 30M out of the table, rather than DELETE the 70M, we can SELECT with(NOLOCK) and clean up the drift later.
The indexes are based on typical access patters. There is a clustered index on Cust_num, Order_Id. Order_ID is a NonClustered PK. There are two columns that we can use to determine which rows we delete/keep. OrderNum is a VARCHAR(55) and the first 8 bytes are the date the record was created, there is also a datestamp.
Here is the basic table structure:
CREATE TABLE [dbo].[orders] (
[order_id] [varchar] (55) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[cust_num] [varchar] (16) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[creation_date] [datetime] NULL
--....
CONSTRAINT [PKN__orders__order_id] PRIMARY KEY NONCLUSTERED
(
[order_id]
) ON [INDEXES] ,
) ON [PRIMARY]
GO
CREATE UNIQUE CLUSTERED INDEX [IXC__orders__cust_num] ON [dbo].[orders]([cust_num]) ON [PRIMARY]
GO
An order_id looks like this:
20060719121929-9af44fee-33bd-48f3-b286-efd23f8aaa77
I'm very interested to hear your response.
Thanks!
Robert Cary
SQL guy and Houston Magician
November 9, 2006 at 6:05 pm
Robert,
Shifting gears a bit, how's the performance of inserts of new records? How about for a returning customer with a new order? Your clusterd index is VERY contrary to the Primary Key and doesn't seem to have all that you indicated... according to what you posted, a customer may only have 1 entry in the table because the unique key is on the Cust_num and not the combination of columns you indicated in your write up. Of course, maybe you assign a new Cust_Num even for returning customers (in which case, you may have an even bigger problem with the design). This is from your first posting...
CREATE UNIQUE CLUSTERED INDEX [IXC__orders__cust_num] ON [dbo].[orders]([cust_num]) ON [PRIMARY]
Also, Serqiy's code works great for something like this but you need to be aware of the simple math here... if you are trying to delete 70,000,000 million rows 500 at a time with a 2 second delay, it will take about 19.44 hours to complete... don't stop the job just because it seems to be taking a long time...
Also, I'd have to say things have gotten pretty much out of hand in that table. Once you get it cleaned up, you might want to consider running Serqiy's code as a scheduled job at least once a day.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 10, 2006 at 7:03 am
Have you thaught about archiving this data on another server... I doubt that this type of infromation is to be deleted... ever!
November 10, 2006 at 10:06 am
It looks like I made a mistake in the Schema def. The Clustered Unique key is actually on Cust_num, Order_Id.
Almost all selects will join cust_num with another table also clustered on cust_num.
It's also worth mentioning that the Orders table does not contain orders in the conventional sense of the word, For this reason, we only need to keep three months worth of data.
I like Sergiy's approach and have something similar to that in place as an ongoing nightly process. But I'm working on the big cleanup right now.
The access is typically returning all (or the top 100) orders for a customer. Occasional we look up a single order by order_num. Again, the table is clustered on
cust_num, order_num
SQL guy and Houston Magician
November 10, 2006 at 10:14 am
It's worth pointing out that I have about a 5 hour window to run this so it may need to span multiple days. One approach I had considered would be to select out the 30,000,000 into a new table, switch the tables, and clean up any drift. I fear, though, an insert from a single transaction would be huge (Several GB) and multiple Inserts would require multiple scans.
Looking at the index, my colleague has suggested making a table of rows to delete based on the clustered index (cust_num, order_num) That would be a much smaller transaction and would enable index seeks on the rows to delete) This may be the best option, but I am very interested to hear what the SQL server intelligentsia have to add.
Thanks so much!
SQL guy and Houston Magician
November 10, 2006 at 6:59 pm
If you run Serqiy's method, you can let it "crawl" all day... it will not interfere with normal daily processes...
--Jeff Moden
Change is inevitable... Change for the better is not.
November 13, 2006 at 10:06 am
Select ... into <new table> from <old table> where <rows to keep>
drop <old table>
sp_rename <new table>,<old table>
barring any "constraints"
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply