October 31, 2011 at 7:26 am
I'm predominantly from an oracle background so not particularly familiar with sql server, though there will be similarities.
Not sure where the best place to post this is but i am a sql server newbie - please move if more appropriate elsewhere
We have an audit system (it is actually part of the Business Objects software however the issue is more related to sql server than Business Objects) which is running on sql server 2005 (9.00.4035.00 SP3)
However, it has no built in system for tidying up data and the data volume is growing on a daily basis..
The 2 tables have grown to millions of records (1 has 13 million the other 80 million).
This causes a couple of problems:
1) Running audits is taking longer and the daily report sometimes fails
2) The Server team are worried about the disk space as the sql server cluster also runs other systems.
So, what I am looking for is a way to do a quick one off data cull as we don't need to go back to day 1 - i want to keep some of the existing data but remove the records say over 6 months old.
Being an oracle person i'm using the SQL Developer App (with a SQL Developer Java driver installed)to run queries off the system - Using this, running queries takes ages because there is so much data - even getting a count on one of the tables takes about 1/2 hour. Therefore I need a way to cull the data without it taking hours (am guessing the database is generating rollbacks or something in the background when using deletes).
The query on the larger table is fairly simple:
delete
from audit_event
where start_timestamp < '01-may-2011'
The query on the smaller table will be more involved and needs rewriting to work in sql server but will be of this form:
delete
from audit_detail d
where not exists (select 'x'
from audit_event e
where e.event_id = d.event_id
and e.server_cuid = d.server_cuid)
October 31, 2011 at 7:57 am
The second query should work as written in T-SQL. Are you getting an error message from it, or just assuming it needs to be rewritten?
The usual way to cut down the locking and such on large deletes is to loop with a row-limited on it.
select 1 as Col1 into #T
while @@rowcount > 0
delete top (1000) from MyTable where MyColumn = MyCriteria;
That's a generic version of it. With that, it won't matter much how long it takes to run, since it will take smaller, shorter locks, and interfere less with other processes.
The other thing you'll want to look into is partitioning the tables. At least on the bigger one, you could partition based on the date column, and that would give you the ability to very rapidly drop the data you no longer need.
Does that help?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 2, 2011 at 4:45 am
Thanks for that and apologies for the delay in replying
I will give those tips a try and look like what I was after
Partitioning is probably something we will look at in the future, but i think in the meantime we'll just clear out the data that is not needed.
I have not got as far as running the second query, just assumed there would be a better way to write it in t-sql
November 2, 2011 at 6:27 am
There are other ways to write it in T-SQL. "Better" is situational on all of the variations.
You'll want to take a look at Gail Shaw's blog entry on the subject: http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/
It lists a few other articles at the end, and they're all useful material.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
November 3, 2011 at 2:14 pm
laurence.pulle (10/31/2011)
The query on the larger table is fairly simple:
delete
from audit_event
where start_timestamp < '01-may-2011'
The query on the smaller table will be more involved and needs rewriting to work in sql server but will be of this form:
delete
from audit_detail d
where not exists (select 'x'
from audit_event e
where e.event_id = d.event_id
and e.server_cuid = d.server_cuid)
I'm going to suggest a few things. If they seem too simplistic for someone with your experience, I apologize. I just want to make sure that the right things are being done to give you good performance:
1. Make sure that the "start_timestamp" column is indexed for efficient query execution.
2. Take Gus's advice on a loop to delete a few records at a time. It's important for many reasons. And stay below 5000 row deletions at at time. 5000 or more record locks will cause a table lock.
3. Instead of using COUNT to determine the number of records in a table, consider using MAX(id) if the table has an IDENTITY column named "id" (or something else). If there are no gaps in the IDENTITY columns numerical sequence, this will give you the record count. If records have been deleted from the beginning of the table, but no deletions in the middle of the table, you should be able to perform a MAX(id) minus MIN(id) plus 1, to get an accurate record count.
4. If you can avoid using the reserved word "NOT" (as in NOT EXISTS), you will in general get faster performance. NOTs cause scans.
LC
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply