MDW Database bloats; Purge job runs continuously

Question

MDW Database bloats; Purge job runs continuously

Moffitar

Ten Centuries

Points: 1185
More actions
February 1, 2011 at 5:23 pm

#233042

Hi All,
I have a SQL 2008 R2 Enterprise instance (named SQLUTILITY) which is monitoring six other SQL 2008 R2 instances. This instance has a single database: Sysutility_MDW. The database was created when I ran the wizard to make it a Utility Control Point. I then ran the UCP wizard on each monitored instance to point to SQLUTILITY.
I then also enabled Data Collection on each of the monitored instance to gather additional statistics (Server, Query, Disk). The collectors and UCP have been running on each instance just fine, since I set them up on January 18th, 2011.
However, today is February 1st, 2011, and the Sysutility_MDW database is now 300GB in size. In SSMS I can see that the data file is really full, too. There is only 1% of slack space. I find this a bit shocking. The monitored instances are not particularly busy, and all the collection schedules are at their default setting. I had expected a monitoring database to be a few gigabytes in size.
I see that there is a daily job to purge the database, but it runs and runs and the database never shrinks.
This is not my first implementation of MDW. We had previously set it up on our Development cluster as a clustered instance. The Purge job ran continuously and actually hammered the network because the database was on an iSCSI drive, and it saturated the switch with iSCSI disk traffic.
We shut that down and created a new UCP/MDW solution on a dedicated instance, using local storage. Now it’s not interfering with anything else, and the reports it generates are great, but the database has not stopped growing. And the Purge job is constantly running.
Things I’ve tried:
· I have Googled the issue and I saw that there was a database bloat issue in SQL 2008 R1, caused by orphaned records, that was supposedly corrected in SQL 2008 R2, which is what I am running (http://support.microsoft.com/kb/971823)
· I found one person who said he improved performance by adding a nonclustered index , which I went ahead and created today (http://blogs.msdn.com/b/petersad/archive/2009/04/23/sql-server-data-collector-nightly-purge-can-leave-orphaned-rows.aspx) . No idea how effective it will be.
· I found a blog post (http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/) by Todd Baker who also had my problem. He solved it by modifying the stored procedure that handles purges ([Core].[SP_Purge_Data]) to use a temp table, allowing it to process records much more quickly. I copied my SP_Purge_Data to SP_Purge_Data2, cancelled my Purge_MDW SQL Agent job, and executed the sproc directly. It is still running.
I have several questions:
1. It is supported, or advisable, to use a single MDW database for both UCP and other data collectors? Could this be introducing orphaned records?
2. Is there a recommended maximum number of instances to monitor per MDW solution? It occurs to me that the data collectors may be uploading data faster than the purge can keep up with.
3. Has anyone tried the solution suggested by Todd Baker? Does it work? Is it supportable?
4. Is there something else I am missing?
(Cross posted to TechNet forum at: http://social.msdn.microsoft.com/Forums/en-US/sqltools/thread/9c21d232-d1ea-439e-a28b-919711d706bb)

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Antonuel Old Hand Points: 348 More actions · Answer 1

Have you checked what table is using most of the space? I bet is snapshots.notable_query_plan.

I had the same issue in the past with 2008 R1, and all I could do was deleting the rows from the above table where rows were older than 1 month and then rebuilt indexes regularly. I added the delete as part of the purge job and that kept the mdw DB under a controlled size.

I haven't used it on R2, but I was hoping the issue was fixed. Please keep us posted if you find a proper fix.

Moffitar Ten Centuries Points: 1185 More actions · Answer 2

No response from Microsoft yet, so I don't know what the official fix for this would be.

But I think I have a workaround in place. The modified stored procedure suggested in Todd Baker's blog post has worked for me. I manually ran it last night. It ran for 12 hours with the following effect:

1. The primary data file is still padded at 212GB, but it now has 34% free (vs 1% free). I decided to leave it at its current size.

2. The log file, which was 90GB and wouldn't let me shrink it (possibly due to pending transactions) went to 98% free and let me shrink it to its original size of 10MB.

3. Apparently none of the statistical history is gone; I can still browse back to January 18th, when I first set it up.

4. Reports are rendering much, much, much faster. For some reports, it used to take 30-60 seconds (or more) for the page to load. Now it's nearly instantaneous.

5. UCP also seems to be working fine: it contains current data and it is performing faster, as well.

So I think I am ready to recommend "The Baker Plan" 🙂

http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/

I was a little bit uncertain as to where to paste in his modified code, so I took a guess. It works for me. Note that anything you change in the business logic of the MDW database could (and probably will) be overwritten, or rendered inoperable, when Microsoft releases any future service packs. So, caveat emptor.

Below is my modified "sp_purge_data2" sproc:

[font="Courier New"]USE [sysutility_mdw]

GO

/****** Object: StoredProcedure [core].[sp_purge_data2] Script Date: 02/02/2011 08:29:15 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE PROCEDURE [core].[sp_purge_data2]

@retention_days smallint = NULL,

@instance_name sysname = NULL,

@collection_set_uid uniqueidentifier = NULL,

@duration smallint = NULL

AS

BEGIN

-- Security check (role membership)

IF (NOT (ISNULL(IS_MEMBER(N'mdw_admin'), 0) = 1) AND NOT (ISNULL(IS_SRVROLEMEMBER(N'sysadmin'), 0) = 1))

BEGIN

RAISERROR(14677, 16, -1, 'mdw_admin')

RETURN(1) -- Failure

END

-- Validate parameters

IF ((@retention_days IS NOT NULL) AND (@retention_days < 0))

BEGIN

RAISERROR(14200, -1, -1, '@retention_days')

RETURN(1) -- Failure

END

IF ((@duration IS NOT NULL) AND (@duration < 0))

BEGIN

RAISERROR(14200, -1, -1, '@duration')

RETURN(1) -- Failure

END

-- This table will contain a record if somebody requests purge to stop

-- If user requested us to purge data - we reset the content of it - and proceed with purge

-- If somebody in a different session wants purge operations to stop he adds a record

-- that we will discover while purge in progress

--

-- We dont clear this flag when we exit since multiple purge operations with differnet

-- filters may proceed, and we want all of them to stop.

DELETE FROM [core].[purge_info_internal]

SET @instance_name = NULLIF(LTRIM(RTRIM(@instance_name)), N'')

-- Calculate the time when the operation should stop (NULL otherwise)

DECLARE @end_time datetime

IF (@duration IS NOT NULL)

BEGIN

SET @end_time = DATEADD(minute, @duration, GETUTCDATE())

END

-- Declare table that will be used to find what are the valid

-- candidate snapshots that could be selected for purge

DECLARE @purge_candidates table

(

snapshot_id int NOT NULL,

snapshot_time datetime NOT NULL,

instance_name sysname NOT NULL,

collection_set_uid uniqueidentifier NOT NULL

)

-- Find candidates that match the retention_days criteria (if specified)

IF (@retention_days IS NULL)

BEGIN

-- User did not specified a value for @retention_days, therfore we

-- will use the default expiration day as marked in the source info

INSERT INTO @purge_candidates

SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid

FROM core.snapshots s

WHERE (GETUTCDATE() >= s.valid_through)

END

ELSE

BEGIN

-- User specified a value for @retention_days, we will use this overriden value

-- when deciding what means old enough to qualify for purge this overrides

-- the days_until_expiration value specified in the source_info_internal table

INSERT INTO @purge_candidates

SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid

FROM core.snapshots s

WHERE GETUTCDATE() >= DATEADD(DAY, @retention_days, s.snapshot_time)

END

-- Determine which is the oldest snapshot, from the list of candidates

DECLARE oldest_snapshot_cursor CURSOR FORWARD_ONLY READ_ONLY FOR

SELECT p.snapshot_id, p.instance_name, p.collection_set_uid

FROM @purge_candidates p

WHERE

((@instance_name IS NULL) or (p.instance_name = @instance_name)) AND

((@collection_set_uid IS NULL) or (p.collection_set_uid = @collection_set_uid))

ORDER BY p.snapshot_time ASC

OPEN oldest_snapshot_cursor

DECLARE @stop_purge int

DECLARE @oldest_snapshot_id int

DECLARE @oldest_instance_name sysname

DECLARE @oldest_collection_set_uid uniqueidentifier

FETCH NEXT FROM oldest_snapshot_cursor

INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid

-- As long as there are snapshots that matched the time criteria

WHILE @@FETCH_STATUS = 0

BEGIN

-- Filter out records that do not match the other filter crieria

IF ((@instance_name IS NULL) or (@oldest_instance_name = @instance_name))

BEGIN

-- There was no filter specified for instance_name or the instance matches the filter

IF ((@collection_set_uid IS NULL) or (@oldest_collection_set_uid = @collection_set_uid))

BEGIN

-- There was no filter specified for the collection_set_uid or the collection_set_uid matches the filter

BEGIN TRANSACTION tran_sp_purge_data

-- Purge data associated with this snapshot. Note: deleting this snapshot

-- triggers cascade delete in all warehouse tables based on the foreign key

-- relationship to snapshots table

-- Cascade cleanup of all data related referencing oldest snapshot

DELETE core.snapshots_internal

FROM core.snapshots_internal s

WHERE s.snapshot_id = @oldest_snapshot_id

COMMIT TRANSACTION tran_sp_purge_data

PRINT 'Snapshot #' + CONVERT(NVARCHAR(MAX),@oldest_snapshot_id) + ' purged.';

END

-- Check if the execution of the stored proc exceeded the @duration specified

IF (@duration IS NOT NULL)

BEGIN

IF (GETUTCDATE()>=@end_time)

BEGIN

PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';

BREAK

END

-- Check if somebody wanted to stop the purge operation

SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]

IF (@stop_purge > 0)

BEGIN

PRINT 'Stopping purge. Detected a user request to stop purge.';

BREAK

END

-- Move to next oldest snapshot

FETCH NEXT FROM oldest_snapshot_cursor

INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid

END

CLOSE oldest_snapshot_cursor

DEALLOCATE oldest_snapshot_cursor

PRINT 'Deleting orphaned rows from snapshots.notable_query_plan...'

-- Delete orphaned rows from snapshots.notable_query_plan. Query plans are not deleted by the generic purge

-- process that deletes other data (above) because query plan rows are not tied to a particular snapshot ID.

-- Purging query plans table and the smaller query text table as a special case, by looking for plans that

-- are no longer referenced by any of the rows in the snapshots.query_stats table. We need to delete these

-- rows in small chunks, since deleting many GB in a single delete statement would cause lock escalation and

-- an explosion in the size of the transaction log (individual query plans can be 10-50MB).

DECLARE @delete_batch_size bigint;

DECLARE @rows_affected int;

SET @delete_batch_size = 500;

SET @rows_affected = 500;

--Wasn't sure if this WHILE clause was still needed

WHILE (@rows_affected = @delete_batch_size)

BEGIN

-- DELETE TOP (@delete_batch_size) snapshots.notable_query_plan

-- FROM snapshots.notable_query_plan AS qp

-- WHERE NOT EXISTS (

-- SELECT snapshot_id

-- FROM snapshots.query_stats AS qs

-- WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle

-- AND qs.plan_generation_num = qp.plan_generation_num

-- AND qs.statement_start_offset = qp.statement_start_offset

-- AND qs.statement_end_offset = qp.statement_end_offset

-- AND qs.creation_time = qp.creation_time);

-- SET @rows_affected = @@ROWCOUNT;

-- IF(@rows_affected > 0)

-- BEGIN

-- RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1, @rows_affected) WITH NOWAIT;

-- END

------------------------------------------------------

----------------BEGIN NEW CODE

-- Pasted from http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/

------------------------------------------------------

select

sql_handle,

plan_handle,

plan_generation_num,

statement_start_offset,

statement_end_offset,

creation_time

into #nqp

FROM snapshots.notable_query_plan qp

WHERE NOT EXISTS (

SELECT *

FROM snapshots.query_stats AS qs

WHERE qs.[sql_handle] = qp.[sql_handle]

AND qs.plan_handle = qp.plan_handle

AND qs.plan_generation_num = qp.plan_generation_num

AND qs.statement_start_offset = qp.statement_start_offset

AND qs.statement_end_offset = qp.statement_end_offset

AND qs.creation_time = qp.creation_time)

SET @rows_affected=1

WHILE (@rows_affected>0)

BEGIN

DELETE TOP (@delete_batch_size)

FROM snapshots.notable_query_plan

FROM #nqp n

WHERE

n.sql_handle=notable_query_plan.sql_handle

AND notable_query_plan.plan_handle = n.plan_handle

AND notable_query_plan.plan_generation_num=n.plan_generation_num

AND notable_query_plan.statement_end_offset=n.statement_end_offset

AND notable_query_plan.statement_start_offset=n.statement_start_offset

AND notable_query_plan.creation_time=n.creation_time

SET @rows_affected = @@ROWCOUNT;

IF(@rows_affected > 0)

BEGIN

RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1,

@rows_affected) WITH NOWAIT;

END

drop table #nqp

------------------------------------------------------

----------------END NEW CODE

------------------------------------------------------

-- Check if the execution of the stored proc exceeded the @duration specified

IF (@duration IS NOT NULL)

BEGIN

IF (GETUTCDATE()>=@end_time)

BEGIN

PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';

BREAK

END

-- Check if somebody wanted to stop the purge operation

SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]

IF (@stop_purge > 0)

BEGIN

PRINT 'Stopping purge. Detected a user request to stop purge.';

BREAK

END

END;

-- Do the same purge process for query text rows in the snapshots.notable_query_text table.

SET @rows_affected = 500;

WHILE (@rows_affected = @delete_batch_size)

BEGIN

DELETE TOP (@delete_batch_size) snapshots.notable_query_text

FROM snapshots.notable_query_text AS qt

WHERE NOT EXISTS (

SELECT snapshot_id

FROM snapshots.query_stats AS qs

WHERE qs.[sql_handle] = qt.[sql_handle]);

SET @rows_affected = @@ROWCOUNT;

IF(@rows_affected > 0)

BEGIN

RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_text', 0, 1, @rows_affected) WITH NOWAIT;

END

-- Check if the execution of the stored proc exceeded the @duration specified

IF (@duration IS NOT NULL)

BEGIN

IF (GETUTCDATE()>=@end_time)

BEGIN

PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';

BREAK

END

-- Check if somebody wanted to stop the purge operation

SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]

IF (@stop_purge > 0)

BEGIN

PRINT 'Stopping purge. Detected a user request to stop purge.';

BREAK

END

END;

END

GO[/font]

Moffitar Ten Centuries Points: 1185 More actions · Answer 3

Microsoft responded to my post:

Hello,

We are working on fixing this issue in next cumulative update. Please let us know if you would be willing to test this fix on your test machine. I can send you the fixed T-SQL code.

You can email me at sethu.srinivasan@microsoft.com

Thanks

Sethu Srinivasan

SQL Server team

http://blogs.mdsn.com/sqlagent

So, apparently a fix is in the works, and it's up for grabs.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

I STRONGLY STRONGLY encourage you to stop using the MDW and purchase a third party product for your server monitoring. MDW is NOT ready for prime time and will not be so for years (probably not ever to be honest with you) - the proposed "fixes" notwithstanding.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Mads Mikkelsen SSC Enthusiast Points: 102 More actions · Answer 5

Hi Kevin,

Can you please explain why you believe that it is such a bad idea to use MDW to collect data for monitoring reports/tools.

And do you have any recommendations on which third party tools I should use instead?

Ray Gagne SSC Enthusiast Points: 169 More actions · Answer 6

Saw your post about MDW and I must agree about it not being ready.

Meanwhile I want a tool that will persist DMV data and allow me to ID what Procs/queries were running during a timeslot (frequency, IO costs, etc for that period in time) That will allow me to drill down for realy time problem analysis and go back to a period in time later.

I assume they all do the gathering of straight metrics well but which one does this the best?

Sorry for the semi hijacking of the thread. Maybe I should start another?

...Ray

Perry Whittle SSC Guru Points: 234065 More actions · Answer 7

i noticed issues with MDW while testing 2008, i personally wouldn't touch it at present

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Moffitar Ten Centuries Points: 1185 More actions · Answer 8

Update:

I've been using MDW and UCP together for months now, and they have been working fine. Apparently the fix I used has made a big difference. I have eight instances reporting to a single utility server (using Data Collection agents and Utility Control Point agents) and the reports have been very useful for trending and baselining performance, identifying causes of waits, and tracking disk usage. No performance issues, database bloating, disk thrashing, or slow reports. My customers have been very happy with this solution.

For what you get for free (or at least, with the SQL license), it serves me very well. So I don't know why I'm being advised to abandon it and invest thousands of dollars on a solution that may or may not acheive the same result.

The point of this solution is to give my customers some metrics they can use, and to "dumb it down" a little so they can see the impact of bad indexes, expensive queries, and pressure from memory or CPU. It gives the database designers feedback on how the changes they make have impacted performance. True, it has required me to fiddle around and make changes to an imperfect monitoring solution, but since when has any of us shied away from that?

The point is, I can never seem to convince my customers to buy tools from Idera or Redgate, even after demos and evaluation. But every time I've shown them MDW and UCP, they immediately liked it, especially now that I've stabilized it. Make of that what you will. I'm going to keep installing it.

stumap SSC Enthusiast Points: 130 More actions · Answer 9

I have been using MDW for 3 days and my ManagementDW database is averaging 1Gb/day growth.

I am manually running the purge but I don't seem to be getting the same space reduction, and I only have 11870 rows in notable_query_plan

SizeMBRowCountTableName

43311870notable_query_plan

I am going to continue using MDW as we have the microsoft stack of products and see where it goes.

anill.vemula SSC Journeyman Points: 75 More actions · Answer 10

I have copied the above stored procedure and executed in my SQL editor window. Proc has been created under database. What is the next plan. DO we need to execute the proc to make the free space available in this mdw database.

Please help me

Moffitar Ten Centuries Points: 1185 More actions · Answer 11

Yes, I would execute it and see if it helps. What version of SQL are you using?

I would also recommend posting your question on the tech net forums because those are monitored by Microsoft employees and they generally respond within 24 hours. I suggest that because this is a very old thread and not many people will see your question. A new thread on tech net is likely to get a better response.

I never did figure out why the purge job wasn't working. The fix I posted is something I saw on a blog. I tried it and it worked for me. You might see some improvement but it sounds like your situation is much worse than mine. Consider opening a ticket with Microsoft support.

Keep in mind that the change I made to Microsoft stored procedure is not supported.