February 1, 2011 at 5:23 pm
Hi All,
I have a SQL 2008 R2 Enterprise instance (named SQLUTILITY) which is monitoring six other SQL 2008 R2 instances. This instance has a single database: Sysutility_MDW. The database was created when I ran the wizard to make it a Utility Control Point. I then ran the UCP wizard on each monitored instance to point to SQLUTILITY.
I then also enabled Data Collection on each of the monitored instance to gather additional statistics (Server, Query, Disk). The collectors and UCP have been running on each instance just fine, since I set them up on January 18th, 2011.
However, today is February 1st, 2011, and the Sysutility_MDW database is now 300GB in size. In SSMS I can see that the data file is really full, too. There is only 1% of slack space. I find this a bit shocking. The monitored instances are not particularly busy, and all the collection schedules are at their default setting. I had expected a monitoring database to be a few gigabytes in size.
I see that there is a daily job to purge the database, but it runs and runs and the database never shrinks.
This is not my first implementation of MDW. We had previously set it up on our Development cluster as a clustered instance. The Purge job ran continuously and actually hammered the network because the database was on an iSCSI drive, and it saturated the switch with iSCSI disk traffic.
We shut that down and created a new UCP/MDW solution on a dedicated instance, using local storage. Now it’s not interfering with anything else, and the reports it generates are great, but the database has not stopped growing. And the Purge job is constantly running.
Things I’ve tried:
· I have Googled the issue and I saw that there was a database bloat issue in SQL 2008 R1, caused by orphaned records, that was supposedly corrected in SQL 2008 R2, which is what I am running (http://support.microsoft.com/kb/971823)
· I found one person who said he improved performance by adding a nonclustered index , which I went ahead and created today (http://blogs.msdn.com/b/petersad/archive/2009/04/23/sql-server-data-collector-nightly-purge-can-leave-orphaned-rows.aspx) . No idea how effective it will be.
· I found a blog post (http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/) by Todd Baker who also had my problem. He solved it by modifying the stored procedure that handles purges ([Core].[SP_Purge_Data]) to use a temp table, allowing it to process records much more quickly. I copied my SP_Purge_Data to SP_Purge_Data2, cancelled my Purge_MDW SQL Agent job, and executed the sproc directly. It is still running.
I have several questions:
1. It is supported, or advisable, to use a single MDW database for both UCP and other data collectors? Could this be introducing orphaned records?
2. Is there a recommended maximum number of instances to monitor per MDW solution? It occurs to me that the data collectors may be uploading data faster than the purge can keep up with.
3. Has anyone tried the solution suggested by Todd Baker? Does it work? Is it supportable?
4. Is there something else I am missing?
(Cross posted to TechNet forum at: http://social.msdn.microsoft.com/Forums/en-US/sqltools/thread/9c21d232-d1ea-439e-a28b-919711d706bb)
February 1, 2011 at 6:19 pm
Have you checked what table is using most of the space? I bet is snapshots.notable_query_plan.
I had the same issue in the past with 2008 R1, and all I could do was deleting the rows from the above table where rows were older than 1 month and then rebuilt indexes regularly. I added the delete as part of the purge job and that kept the mdw DB under a controlled size.
I haven't used it on R2, but I was hoping the issue was fixed. Please keep us posted if you find a proper fix.
February 2, 2011 at 10:37 am
No response from Microsoft yet, so I don't know what the official fix for this would be.
But I think I have a workaround in place. The modified stored procedure suggested in Todd Baker's blog post has worked for me. I manually ran it last night. It ran for 12 hours with the following effect:
1. The primary data file is still padded at 212GB, but it now has 34% free (vs 1% free). I decided to leave it at its current size.
2. The log file, which was 90GB and wouldn't let me shrink it (possibly due to pending transactions) went to 98% free and let me shrink it to its original size of 10MB.
3. Apparently none of the statistical history is gone; I can still browse back to January 18th, when I first set it up.
4. Reports are rendering much, much, much faster. For some reports, it used to take 30-60 seconds (or more) for the page to load. Now it's nearly instantaneous.
5. UCP also seems to be working fine: it contains current data and it is performing faster, as well.
So I think I am ready to recommend "The Baker Plan" 🙂
http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/
I was a little bit uncertain as to where to paste in his modified code, so I took a guess. It works for me. Note that anything you change in the business logic of the MDW database could (and probably will) be overwritten, or rendered inoperable, when Microsoft releases any future service packs. So, caveat emptor.
Below is my modified "sp_purge_data2" sproc:
[font="Courier New"]USE [sysutility_mdw]
GO
/****** Object: StoredProcedure [core].[sp_purge_data2] Script Date: 02/02/2011 08:29:15 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [core].[sp_purge_data2]
@retention_days smallint = NULL,
@instance_name sysname = NULL,
@collection_set_uid uniqueidentifier = NULL,
@duration smallint = NULL
AS
BEGIN
-- Security check (role membership)
IF (NOT (ISNULL(IS_MEMBER(N'mdw_admin'), 0) = 1) AND NOT (ISNULL(IS_SRVROLEMEMBER(N'sysadmin'), 0) = 1))
BEGIN
RAISERROR(14677, 16, -1, 'mdw_admin')
RETURN(1) -- Failure
END
-- Validate parameters
IF ((@retention_days IS NOT NULL) AND (@retention_days < 0))
BEGIN
RAISERROR(14200, -1, -1, '@retention_days')
RETURN(1) -- Failure
END
IF ((@duration IS NOT NULL) AND (@duration < 0))
BEGIN
RAISERROR(14200, -1, -1, '@duration')
RETURN(1) -- Failure
END
-- This table will contain a record if somebody requests purge to stop
-- If user requested us to purge data - we reset the content of it - and proceed with purge
-- If somebody in a different session wants purge operations to stop he adds a record
-- that we will discover while purge in progress
--
-- We dont clear this flag when we exit since multiple purge operations with differnet
-- filters may proceed, and we want all of them to stop.
DELETE FROM [core].[purge_info_internal]
SET @instance_name = NULLIF(LTRIM(RTRIM(@instance_name)), N'')
-- Calculate the time when the operation should stop (NULL otherwise)
DECLARE @end_time datetime
IF (@duration IS NOT NULL)
BEGIN
SET @end_time = DATEADD(minute, @duration, GETUTCDATE())
END
-- Declare table that will be used to find what are the valid
-- candidate snapshots that could be selected for purge
DECLARE @purge_candidates table
(
snapshot_id int NOT NULL,
snapshot_time datetime NOT NULL,
instance_name sysname NOT NULL,
collection_set_uid uniqueidentifier NOT NULL
)
-- Find candidates that match the retention_days criteria (if specified)
IF (@retention_days IS NULL)
BEGIN
-- User did not specified a value for @retention_days, therfore we
-- will use the default expiration day as marked in the source info
INSERT INTO @purge_candidates
SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid
FROM core.snapshots s
WHERE (GETUTCDATE() >= s.valid_through)
END
ELSE
BEGIN
-- User specified a value for @retention_days, we will use this overriden value
-- when deciding what means old enough to qualify for purge this overrides
-- the days_until_expiration value specified in the source_info_internal table
INSERT INTO @purge_candidates
SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid
FROM core.snapshots s
WHERE GETUTCDATE() >= DATEADD(DAY, @retention_days, s.snapshot_time)
END
-- Determine which is the oldest snapshot, from the list of candidates
DECLARE oldest_snapshot_cursor CURSOR FORWARD_ONLY READ_ONLY FOR
SELECT p.snapshot_id, p.instance_name, p.collection_set_uid
FROM @purge_candidates p
WHERE
((@instance_name IS NULL) or (p.instance_name = @instance_name)) AND
((@collection_set_uid IS NULL) or (p.collection_set_uid = @collection_set_uid))
ORDER BY p.snapshot_time ASC
OPEN oldest_snapshot_cursor
DECLARE @stop_purge int
DECLARE @oldest_snapshot_id int
DECLARE @oldest_instance_name sysname
DECLARE @oldest_collection_set_uid uniqueidentifier
FETCH NEXT FROM oldest_snapshot_cursor
INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid
-- As long as there are snapshots that matched the time criteria
WHILE @@FETCH_STATUS = 0
BEGIN
-- Filter out records that do not match the other filter crieria
IF ((@instance_name IS NULL) or (@oldest_instance_name = @instance_name))
BEGIN
-- There was no filter specified for instance_name or the instance matches the filter
IF ((@collection_set_uid IS NULL) or (@oldest_collection_set_uid = @collection_set_uid))
BEGIN
-- There was no filter specified for the collection_set_uid or the collection_set_uid matches the filter
BEGIN TRANSACTION tran_sp_purge_data
-- Purge data associated with this snapshot. Note: deleting this snapshot
-- triggers cascade delete in all warehouse tables based on the foreign key
-- relationship to snapshots table
-- Cascade cleanup of all data related referencing oldest snapshot
DELETE core.snapshots_internal
FROM core.snapshots_internal s
WHERE s.snapshot_id = @oldest_snapshot_id
COMMIT TRANSACTION tran_sp_purge_data
PRINT 'Snapshot #' + CONVERT(NVARCHAR(MAX),@oldest_snapshot_id) + ' purged.';
END
END
-- Check if the execution of the stored proc exceeded the @duration specified
IF (@duration IS NOT NULL)
BEGIN
IF (GETUTCDATE()>=@end_time)
BEGIN
PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';
BREAK
END
END
-- Check if somebody wanted to stop the purge operation
SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]
IF (@stop_purge > 0)
BEGIN
PRINT 'Stopping purge. Detected a user request to stop purge.';
BREAK
END
-- Move to next oldest snapshot
FETCH NEXT FROM oldest_snapshot_cursor
INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid
END
CLOSE oldest_snapshot_cursor
DEALLOCATE oldest_snapshot_cursor
PRINT 'Deleting orphaned rows from snapshots.notable_query_plan...'
-- Delete orphaned rows from snapshots.notable_query_plan. Query plans are not deleted by the generic purge
-- process that deletes other data (above) because query plan rows are not tied to a particular snapshot ID.
-- Purging query plans table and the smaller query text table as a special case, by looking for plans that
-- are no longer referenced by any of the rows in the snapshots.query_stats table. We need to delete these
-- rows in small chunks, since deleting many GB in a single delete statement would cause lock escalation and
-- an explosion in the size of the transaction log (individual query plans can be 10-50MB).
DECLARE @delete_batch_size bigint;
DECLARE @rows_affected int;
SET @delete_batch_size = 500;
SET @rows_affected = 500;
--Wasn't sure if this WHILE clause was still needed
WHILE (@rows_affected = @delete_batch_size)
BEGIN
-- DELETE TOP (@delete_batch_size) snapshots.notable_query_plan
-- FROM snapshots.notable_query_plan AS qp
-- WHERE NOT EXISTS (
-- SELECT snapshot_id
-- FROM snapshots.query_stats AS qs
-- WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle
-- AND qs.plan_generation_num = qp.plan_generation_num
-- AND qs.statement_start_offset = qp.statement_start_offset
-- AND qs.statement_end_offset = qp.statement_end_offset
-- AND qs.creation_time = qp.creation_time);
-- SET @rows_affected = @@ROWCOUNT;
-- IF(@rows_affected > 0)
-- BEGIN
-- RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1, @rows_affected) WITH NOWAIT;
-- END
------------------------------------------------------
----------------BEGIN NEW CODE
-- Pasted from http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/
------------------------------------------------------
select
sql_handle,
plan_handle,
plan_generation_num,
statement_start_offset,
statement_end_offset,
creation_time
into #nqp
FROM snapshots.notable_query_plan qp
WHERE NOT EXISTS (
SELECT *
FROM snapshots.query_stats AS qs
WHERE qs.[sql_handle] = qp.[sql_handle]
AND qs.plan_handle = qp.plan_handle
AND qs.plan_generation_num = qp.plan_generation_num
AND qs.statement_start_offset = qp.statement_start_offset
AND qs.statement_end_offset = qp.statement_end_offset
AND qs.creation_time = qp.creation_time)
SET @rows_affected=1
WHILE (@rows_affected>0)
BEGIN
DELETE TOP (@delete_batch_size)
FROM snapshots.notable_query_plan
FROM #nqp n
WHERE
n.sql_handle=notable_query_plan.sql_handle
AND notable_query_plan.plan_handle = n.plan_handle
AND notable_query_plan.plan_generation_num=n.plan_generation_num
AND notable_query_plan.statement_end_offset=n.statement_end_offset
AND notable_query_plan.statement_start_offset=n.statement_start_offset
AND notable_query_plan.creation_time=n.creation_time
SET @rows_affected = @@ROWCOUNT;
IF(@rows_affected > 0)
BEGIN
RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1,
@rows_affected) WITH NOWAIT;
END
END
drop table #nqp
------------------------------------------------------
----------------END NEW CODE
------------------------------------------------------
-- Check if the execution of the stored proc exceeded the @duration specified
IF (@duration IS NOT NULL)
BEGIN
IF (GETUTCDATE()>=@end_time)
BEGIN
PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';
BREAK
END
END
-- Check if somebody wanted to stop the purge operation
SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]
IF (@stop_purge > 0)
BEGIN
PRINT 'Stopping purge. Detected a user request to stop purge.';
BREAK
END
END;
-- Do the same purge process for query text rows in the snapshots.notable_query_text table.
SET @rows_affected = 500;
WHILE (@rows_affected = @delete_batch_size)
BEGIN
DELETE TOP (@delete_batch_size) snapshots.notable_query_text
FROM snapshots.notable_query_text AS qt
WHERE NOT EXISTS (
SELECT snapshot_id
FROM snapshots.query_stats AS qs
WHERE qs.[sql_handle] = qt.[sql_handle]);
SET @rows_affected = @@ROWCOUNT;
IF(@rows_affected > 0)
BEGIN
RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_text', 0, 1, @rows_affected) WITH NOWAIT;
END
-- Check if the execution of the stored proc exceeded the @duration specified
IF (@duration IS NOT NULL)
BEGIN
IF (GETUTCDATE()>=@end_time)
BEGIN
PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';
BREAK
END
END
-- Check if somebody wanted to stop the purge operation
SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]
IF (@stop_purge > 0)
BEGIN
PRINT 'Stopping purge. Detected a user request to stop purge.';
BREAK
END
END;
END
GO[/font]
February 2, 2011 at 1:15 pm
Microsoft responded to my post:
Hello,
We are working on fixing this issue in next cumulative update. Please let us know if you would be willing to test this fix on your test machine. I can send you the fixed T-SQL code.
You can email me at sethu.srinivasan@microsoft.com
Thanks
Sethu Srinivasan
SQL Server team
http://blogs.mdsn.com/sqlagent%5B/i%5D
So, apparently a fix is in the works, and it's up for grabs.
February 3, 2011 at 9:03 am
I STRONGLY STRONGLY encourage you to stop using the MDW and purchase a third party product for your server monitoring. MDW is NOT ready for prime time and will not be so for years (probably not ever to be honest with you) - the proposed "fixes" notwithstanding.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
February 15, 2011 at 6:32 am
Hi Kevin,
Can you please explain why you believe that it is such a bad idea to use MDW to collect data for monitoring reports/tools.
And do you have any recommendations on which third party tools I should use instead?
March 31, 2011 at 9:48 am
Saw your post about MDW and I must agree about it not being ready.
Meanwhile I want a tool that will persist DMV data and allow me to ID what Procs/queries were running during a timeslot (frequency, IO costs, etc for that period in time) That will allow me to drill down for realy time problem analysis and go back to a period in time later.
I assume they all do the gathering of straight metrics well but which one does this the best?
Sorry for the semi hijacking of the thread. Maybe I should start another?
...Ray
March 31, 2011 at 10:24 am
i noticed issues with MDW while testing 2008, i personally wouldn't touch it at present
-----------------------------------------------------------------------------------------------------------
"Ya can't make an omelette without breaking just a few eggs" 😉
April 1, 2011 at 9:52 am
Update:
I've been using MDW and UCP together for months now, and they have been working fine. Apparently the fix I used has made a big difference. I have eight instances reporting to a single utility server (using Data Collection agents and Utility Control Point agents) and the reports have been very useful for trending and baselining performance, identifying causes of waits, and tracking disk usage. No performance issues, database bloating, disk thrashing, or slow reports. My customers have been very happy with this solution.
For what you get for free (or at least, with the SQL license), it serves me very well. So I don't know why I'm being advised to abandon it and invest thousands of dollars on a solution that may or may not acheive the same result.
The point of this solution is to give my customers some metrics they can use, and to "dumb it down" a little so they can see the impact of bad indexes, expensive queries, and pressure from memory or CPU. It gives the database designers feedback on how the changes they make have impacted performance. True, it has required me to fiddle around and make changes to an imperfect monitoring solution, but since when has any of us shied away from that?
The point is, I can never seem to convince my customers to buy tools from Idera or Redgate, even after demos and evaluation. But every time I've shown them MDW and UCP, they immediately liked it, especially now that I've stabilized it. Make of that what you will. I'm going to keep installing it.
May 4, 2011 at 3:15 pm
I have been using MDW for 3 days and my ManagementDW database is averaging 1Gb/day growth.
I am manually running the purge but I don't seem to be getting the same space reduction, and I only have 11870 rows in notable_query_plan
SizeMBRowCountTableName
43311870notable_query_plan
I am going to continue using MDW as we have the microsoft stack of products and see where it goes.
April 3, 2014 at 1:07 am
I have copied the above stored procedure and executed in my SQL editor window. Proc has been created under database. What is the next plan. DO we need to execute the proc to make the free space available in this mdw database.
Please help me
April 3, 2014 at 10:10 am
Yes, I would execute it and see if it helps. What version of SQL are you using?
I would also recommend posting your question on the tech net forums because those are monitored by Microsoft employees and they generally respond within 24 hours. I suggest that because this is a very old thread and not many people will see your question. A new thread on tech net is likely to get a better response.
I never did figure out why the purge job wasn't working. The fix I posted is something I saw on a blog. I tried it and it worked for me. You might see some improvement but it sounds like your situation is much worse than mine. Consider opening a ticket with Microsoft support.
Keep in mind that the change I made to Microsoft stored procedure is not supported.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply