Looking for help taking the RBAR from this problem

Question

Post reply

Looking for help taking the RBAR from this problem

clubbavich

Default port

Points: 1449
More actions
September 2, 2011 at 2:47 pm

#256612

Hello all,
I'm working on a system that collects data from various configuration and statistics servers and consolidates it.
What I'm trying to do right now is take statistics that are stored by hour, and total them by day and store that information in a different table for reporting purposes.
Sounds easy enough, however there are some issues that need to be addressed.
Due to various factors, occasionally the consolidation job is unable to connect to some of the statistics servers. They are designed to cover the difference the next time the job runs, but that also means that the latest smalldatetime value can differ, server to server. These servers also perform their own consolidation jobs, from smaller intervals to hourly statistics.
The second issue to overcome is to ensure that I am consolidating an entire's day worth of hourly statistics. Unfortunately, I also cannot depend upon the fact that there are 24 hours in each day. Occasionally, these servers may not HAVE statistics for a given hour.
What I can be certain of is that if I have data for, say, 2011-09-01 06:00:00, I also have the data for 2011-09-01 05:00:00 and earlier, if the data exists at all.
With these things in mind, the answer I came up with is to get the latest smalldatetime value per server and use these values for my upper limits.
This is the problem that I am having trouble getting the RBAR out of without it hurting performance.
Here is an example hourly stats table:
CREATE TABLE [dbo].[Hourly_Stats] (
[Server_Id] [tinyint] NOT NULL,
[Device_Id] [int] NOT NULL,
[Data_Timestamp] [smalldatetime] NOT NULL,
[Datapoint_A] [real] NOT NULL,
[Datapoint_B] [real] NOT NULL,
CONSTRAINT [PK_Hourly_Stats] PRIMARY KEY CLUSTERED
(
[Server_Id] ASC,
[Device_Id] ASC, --Device_Id is only unique on a per-server basis.
[Data_Timestamp] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [ix_Server_Timestamps] ON [dbo].[Hourly_Stats]
(
[Server_Id] ASC,
[Data_Timestamp] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
And here is the most efficient method I have found to get the most recent Data_Timestamp value per server:
WITH Latest_DateTime_Per_Server (Server_Id, Data_Timestamp) AS
(
--Using the ORDER BY clause on each table UNION ALL statement only works from within a CTE Declaration.
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 1
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 2
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 3
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 4
ORDER BY Data_Timestamp DESC
UNION ALL
SELECT TOP 1 Server_Id, Data_Timestamp
FROM [dbo].[Hourly_Stats]
WHERE Server_Id = 5
ORDER BY Data_Timestamp DESC
)
SELECT Server_Id, Data_Timestamp FROM Latest_DateTime_Per_Server
This is fast. Average Total Execution time is 16.8. In my situation, I have less than 10 servers, and that number does not change often. However, what If I wanted the latest Data_Timestamp value for each device on each server? This approach does not scale well at all.
The only alternative that I can think of is to group the data by server_id (and device_id if necessary) and get the max value of the Data_Timestamp value. Example:
SELECT Server_Id, MAX(Data_Timestamp)
FROM [dbo].[Hourly_Stats]
GROUP BY Server_Id
This takes much longer, Average Total execution time is 665.8. I cannot think of any other approach to this problem given my current knowledge of SQL. The advice of the experts would certainly be appreciated!

Viewing 15 posts - 1 through 15 (of 34 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323518 More actions · Answer 1

how does a row number example perform?

SELECT

Server_Id,

Data_Timestamp

FROM

(SELECT

ROW_NUMBER() OVER (PARTITION BY Server_Id ORDER BY Data_Timestamp DESC) AS RW,

Server_Id,

Data_Timestamp

FROM

[dbo].[Hourly_Stats])MyAlias

WHERE

RW = 1

ORDER BY

Server_Id,

Data_Timestamp DESC

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

clubbavich Default port Points: 1449 More actions · Answer 2

Fairly slow actually. 🙂 Average for two runs is 19906.0

I should also mention that as of this writing, this table contains over 18 million rows, and that is only expected to increase as time goes on.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

Can you create a nonclustered index on serverId/timestamp only, and then re-check the max function? That should cure your issue, if not, I'd like to see the execution plan of the MAX().

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

kramaswamy SSCoach Points: 18181 More actions · Answer 4

Yeah I've gotta say, I'm surprised your second query performs worse than the first - it's basically doing the same thing, except without the UNION's.

I'd like to see the execution plan if you could post it.

clubbavich Default port Points: 1449 More actions · Answer 5

The index already exists.

Also, keep in mind that the MAX() Method is only taking 650ms, so it's not taking an excessively long time. It's just that the UNION ALL method is orders of magnitude faster for this particular dataset. It may be that the MAX() method is the best in terms of scalability, and the RBAR method is fastest for a small number of results.

*Edit* Kraig, I read the article on how to post performance problems. Execution plans incoming.

*Edit* Execution Plans uploaded. The field names and table names ARE different, as I try to keep information as generic as possible when posting code in public...

kramaswamy SSCoach Points: 18181 More actions · Answer 6

Does your full table contain more than just the four serverID's you specified in your example, for the UNION showplan?

This is just a theory, but maybe the reason why it seems like the UNION approach is better, is because you're not actually getting all of the values in the table, and the servers which you are using just happen to not contain very many records.

When you're using the MAX approach, it is going through about 18 000 000 records, in order to aggregate and then find the MAX datestamp.

If your sample of four serverIDs only had a fraction of those records, then it would certainly seem like it wins, but if you extend it across all serverID's, does it still perform orders of magnitude better?

clubbavich Default port Points: 1449 More actions · Answer 7

There is a Fifth ServerId, but it only has 160,000 rows of the 18 million. There are no other server ID's.

*Edit* I Added the fifth and final ServerId to my UNION ALL code, and there was no discernible effect in time.

rob.symonds SSC Veteran Points: 206 More actions · Answer 8

Have you tried

WHERE ServerId IN (1,2,3,4)

in the group by query just to see if it makes a difference?

It probably wont but worth a try 🙂

clubbavich Default port Points: 1449 More actions · Answer 9

clubbavich

Default port

Points: 1449

September 2, 2011 at 4:58 pm

#1376720

It had no effect sadly.

rob.symonds SSC Veteran Points: 206 More actions · Answer 10

How about an index on ServerId ASC, DateCreated DESC

Again, just another punt, I don't have ssms to try it out.

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 11

@clubbavich,

What is the average number of Device_ID's you have per server? I'm asking because I think I see some major issues with the indexes on this table but don't want to open my yapper until I've done a couple of tests and I need to know how to distribute things on a couple of million rows that I'm going to build.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

clubbavich Default port Points: 1449 More actions · Answer 12

@jeff

Here is how the devices are distributed.

Server_Id, Device_Id

1 has 1540 Device_Id's

2 has 742 Device_Id's

3 has 357 Device_Id's

4 has 26 Device_Id's

5 has 917 Device_Id's

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 13

However, what If I wanted the latest Data_Timestamp value for each device on each server? This approach does not scale well at all.

what sort of times are you currently returning when including each device?

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

clubbavich Default port Points: 1449 More actions · Answer 14

I haven't even attempted it. To do so would be to execute 3600 select statements connected by UNION ALL. I cannot imagine that that would do anything good for performance.