September 2, 2011 at 2:47 pm
Hello all,
I'm working on a system that collects data from various configuration and statistics servers and consolidates it.
What I'm trying to do right now is take statistics that are stored by hour, and total them by day and store that information in a different table for reporting purposes.
Sounds easy enough, however there are some issues that need to be addressed.
Due to various factors, occasionally the consolidation job is unable to connect to some of the statistics servers. They are designed to cover the difference the next time the job runs, but that also means that the latest smalldatetime value can differ, server to server. These servers also perform their own consolidation jobs, from smaller intervals to hourly statistics.
The second issue to overcome is to ensure that I am consolidating an entire's day worth of hourly statistics. Unfortunately, I also cannot depend upon the fact that there are 24 hours in each day. Occasionally, these servers may not HAVE statistics for a given hour.
What I can be certain of is that if I have data for, say, 2011-09-01 06:00:00, I also have the data for 2011-09-01 05:00:00 and earlier, if the data exists at all.
With these things in mind, the answer I came up with is to get the latest smalldatetime value per server and use these values for my upper limits.
This is the problem that I am having trouble getting the RBAR out of without it hurting performance.
Here is an example hourly stats table:
CREATE TABLE [dbo].[Hourly_Stats] (
[Server_Id] [tinyint] NOT NULL,
[Device_Id] [int] NOT NULL,
[Data_Timestamp] [smalldatetime] NOT NULL,
[Datapoint_A] [real] NOT NULL,
[Datapoint_B] [real] NOT NULL,
CONSTRAINT [PK_Hourly_Stats] PRIMARY KEY CLUSTERED
(
[Server_Id] ASC,
[Device_Id] ASC, --Device_Id is only unique on a per-server basis.
[Data_Timestamp] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [ix_Server_Timestamps] ON [dbo].[Hourly_Stats]
(
[Server_Id] ASC,
[Data_Timestamp] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
And here is the most efficient method I have found to get the most recent Data_Timestamp value per server:
WITH Latest_DateTime_Per_Server (Server_Id, Data_Timestamp) AS
(
--Using the ORDER BY clause on each table UNION ALL statement only works from within a CTE Declaration.
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 1
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 2
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 3
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 4
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 5
ORDER BY Data_Timestamp DESC
)
SELECT Server_Id, Data_Timestamp FROM Latest_DateTime_Per_Server
This is fast. Average Total Execution time is 16.8. In my situation, I have less than 10 servers, and that number does not change often. However, what If I wanted the latest Data_Timestamp value for each device on each server? This approach does not scale well at all.
The only alternative that I can think of is to group the data by server_id (and device_id if necessary) and get the max value of the Data_Timestamp value. Example:
SELECT Server_Id, MAX(Data_Timestamp)
FROM [dbo].[Hourly_Stats]
GROUP BY Server_Id
This takes much longer, Average Total execution time is 665.8. I cannot think of any other approach to this problem given my current knowledge of SQL. The advice of the experts would certainly be appreciated!
September 2, 2011 at 3:00 pm
how does a row number example perform?
SELECT
Server_Id,
Data_Timestamp
FROM
(SELECT
ROW_NUMBER() OVER (PARTITION BY Server_Id ORDER BY Data_Timestamp DESC) AS RW,
Server_Id,
Data_Timestamp
FROM
[dbo].[Hourly_Stats])MyAlias
WHERE
RW = 1
ORDER BY
Server_Id,
Data_Timestamp DESC
Lowell
September 2, 2011 at 3:11 pm
Fairly slow actually. 🙂 Average for two runs is 19906.0
I should also mention that as of this writing, this table contains over 18 million rows, and that is only expected to increase as time goes on.
September 2, 2011 at 3:40 pm
Can you create a nonclustered index on serverId/timestamp only, and then re-check the max function? That should cure your issue, if not, I'd like to see the execution plan of the MAX().
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
September 2, 2011 at 3:47 pm
Yeah I've gotta say, I'm surprised your second query performs worse than the first - it's basically doing the same thing, except without the UNION's.
I'd like to see the execution plan if you could post it.
September 2, 2011 at 3:53 pm
The index already exists.
Also, keep in mind that the MAX() Method is only taking 650ms, so it's not taking an excessively long time. It's just that the UNION ALL method is orders of magnitude faster for this particular dataset. It may be that the MAX() method is the best in terms of scalability, and the RBAR method is fastest for a small number of results.
*Edit* Kraig, I read the article on how to post performance problems. Execution plans incoming.
*Edit* Execution Plans uploaded. The field names and table names ARE different, as I try to keep information as generic as possible when posting code in public...
September 2, 2011 at 4:12 pm
Does your full table contain more than just the four serverID's you specified in your example, for the UNION showplan?
This is just a theory, but maybe the reason why it seems like the UNION approach is better, is because you're not actually getting all of the values in the table, and the servers which you are using just happen to not contain very many records.
When you're using the MAX approach, it is going through about 18 000 000 records, in order to aggregate and then find the MAX datestamp.
If your sample of four serverIDs only had a fraction of those records, then it would certainly seem like it wins, but if you extend it across all serverID's, does it still perform orders of magnitude better?
September 2, 2011 at 4:23 pm
There is a Fifth ServerId, but it only has 160,000 rows of the 18 million. There are no other server ID's.
*Edit* I Added the fifth and final ServerId to my UNION ALL code, and there was no discernible effect in time.
September 2, 2011 at 4:37 pm
Have you tried
WHERE ServerId IN (1,2,3,4)
in the group by query just to see if it makes a difference?
It probably wont but worth a try 🙂
September 2, 2011 at 4:58 pm
It had no effect sadly.
September 2, 2011 at 5:14 pm
How about an index on ServerId ASC, DateCreated DESC
Again, just another punt, I don't have ssms to try it out.
September 2, 2011 at 7:16 pm
What is the average number of Device_ID's you have per server? I'm asking because I think I see some major issues with the indexes on this table but don't want to open my yapper until I've done a couple of tests and I need to know how to distribute things on a couple of million rows that I'm going to build.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 3, 2011 at 10:32 am
However, what If I wanted the latest Data_Timestamp value for each device on each server? This approach does not scale well at all.
what sort of times are you currently returning when including each device?
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
September 3, 2011 at 11:16 am
I haven't even attempted it. To do so would be to execute 3600 select statements connected by UNION ALL. I cannot imagine that that would do anything good for performance.
Viewing 15 posts - 1 through 15 (of 34 total)
You must be logged in to reply to this topic. Login to reply