Looking for help taking the RBAR from this problem

  • Hello all,

    I'm working on a system that collects data from various configuration and statistics servers and consolidates it.

    What I'm trying to do right now is take statistics that are stored by hour, and total them by day and store that information in a different table for reporting purposes.

    Sounds easy enough, however there are some issues that need to be addressed.

    Due to various factors, occasionally the consolidation job is unable to connect to some of the statistics servers. They are designed to cover the difference the next time the job runs, but that also means that the latest smalldatetime value can differ, server to server. These servers also perform their own consolidation jobs, from smaller intervals to hourly statistics.

    The second issue to overcome is to ensure that I am consolidating an entire's day worth of hourly statistics. Unfortunately, I also cannot depend upon the fact that there are 24 hours in each day. Occasionally, these servers may not HAVE statistics for a given hour.

    What I can be certain of is that if I have data for, say, 2011-09-01 06:00:00, I also have the data for 2011-09-01 05:00:00 and earlier, if the data exists at all.

    With these things in mind, the answer I came up with is to get the latest smalldatetime value per server and use these values for my upper limits.

    This is the problem that I am having trouble getting the RBAR out of without it hurting performance.

    Here is an example hourly stats table:

    CREATE TABLE [dbo].[Hourly_Stats] (

    [Server_Id] [tinyint] NOT NULL,

    [Device_Id] [int] NOT NULL,

    [Data_Timestamp] [smalldatetime] NOT NULL,

    [Datapoint_A] [real] NOT NULL,

    [Datapoint_B] [real] NOT NULL,

    CONSTRAINT [PK_Hourly_Stats] PRIMARY KEY CLUSTERED

    (

    [Server_Id] ASC,

    [Device_Id] ASC, --Device_Id is only unique on a per-server basis.

    [Data_Timestamp] ASC

    ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [ix_Server_Timestamps] ON [dbo].[Hourly_Stats]

    (

    [Server_Id] ASC,

    [Data_Timestamp] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    And here is the most efficient method I have found to get the most recent Data_Timestamp value per server:

    WITH Latest_DateTime_Per_Server (Server_Id, Data_Timestamp) AS

    (

    --Using the ORDER BY clause on each table UNION ALL statement only works from within a CTE Declaration.

    SELECT TOP 1 Server_Id, Data_Timestamp

    FROM [dbo].[Hourly_Stats]

    WHERE Server_Id = 1

    ORDER BY Data_Timestamp DESC

    UNION ALL

    SELECT TOP 1 Server_Id, Data_Timestamp

    FROM [dbo].[Hourly_Stats]

    WHERE Server_Id = 2

    ORDER BY Data_Timestamp DESC

    UNION ALL

    SELECT TOP 1 Server_Id, Data_Timestamp

    FROM [dbo].[Hourly_Stats]

    WHERE Server_Id = 3

    ORDER BY Data_Timestamp DESC

    UNION ALL

    SELECT TOP 1 Server_Id, Data_Timestamp

    FROM [dbo].[Hourly_Stats]

    WHERE Server_Id = 4

    ORDER BY Data_Timestamp DESC

    UNION ALL

    SELECT TOP 1 Server_Id, Data_Timestamp

    FROM [dbo].[Hourly_Stats]

    WHERE Server_Id = 5

    ORDER BY Data_Timestamp DESC

    )

    SELECT Server_Id, Data_Timestamp FROM Latest_DateTime_Per_Server

    This is fast. Average Total Execution time is 16.8. In my situation, I have less than 10 servers, and that number does not change often. However, what If I wanted the latest Data_Timestamp value for each device on each server? This approach does not scale well at all.

    The only alternative that I can think of is to group the data by server_id (and device_id if necessary) and get the max value of the Data_Timestamp value. Example:

    SELECT Server_Id, MAX(Data_Timestamp)

    FROM [dbo].[Hourly_Stats]

    GROUP BY Server_Id

    This takes much longer, Average Total execution time is 665.8. I cannot think of any other approach to this problem given my current knowledge of SQL. The advice of the experts would certainly be appreciated!

  • how does a row number example perform?

    SELECT

    Server_Id,

    Data_Timestamp

    FROM

    (SELECT

    ROW_NUMBER() OVER (PARTITION BY Server_Id ORDER BY Data_Timestamp DESC) AS RW,

    Server_Id,

    Data_Timestamp

    FROM

    [dbo].[Hourly_Stats])MyAlias

    WHERE

    RW = 1

    ORDER BY

    Server_Id,

    Data_Timestamp DESC

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Fairly slow actually. 🙂 Average for two runs is 19906.0

    I should also mention that as of this writing, this table contains over 18 million rows, and that is only expected to increase as time goes on.

  • Can you create a nonclustered index on serverId/timestamp only, and then re-check the max function? That should cure your issue, if not, I'd like to see the execution plan of the MAX().


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Yeah I've gotta say, I'm surprised your second query performs worse than the first - it's basically doing the same thing, except without the UNION's.

    I'd like to see the execution plan if you could post it.

  • The index already exists.

    Also, keep in mind that the MAX() Method is only taking 650ms, so it's not taking an excessively long time. It's just that the UNION ALL method is orders of magnitude faster for this particular dataset. It may be that the MAX() method is the best in terms of scalability, and the RBAR method is fastest for a small number of results.

    *Edit* Kraig, I read the article on how to post performance problems. Execution plans incoming.

    *Edit* Execution Plans uploaded. The field names and table names ARE different, as I try to keep information as generic as possible when posting code in public...

  • Does your full table contain more than just the four serverID's you specified in your example, for the UNION showplan?

    This is just a theory, but maybe the reason why it seems like the UNION approach is better, is because you're not actually getting all of the values in the table, and the servers which you are using just happen to not contain very many records.

    When you're using the MAX approach, it is going through about 18 000 000 records, in order to aggregate and then find the MAX datestamp.

    If your sample of four serverIDs only had a fraction of those records, then it would certainly seem like it wins, but if you extend it across all serverID's, does it still perform orders of magnitude better?

  • There is a Fifth ServerId, but it only has 160,000 rows of the 18 million. There are no other server ID's.

    *Edit* I Added the fifth and final ServerId to my UNION ALL code, and there was no discernible effect in time.

  • Have you tried

    WHERE ServerId IN (1,2,3,4)

    in the group by query just to see if it makes a difference?

    It probably wont but worth a try 🙂

  • It had no effect sadly.

  • How about an index on ServerId ASC, DateCreated DESC

    Again, just another punt, I don't have ssms to try it out.

  • @clubbavich,

    What is the average number of Device_ID's you have per server? I'm asking because I think I see some major issues with the indexes on this table but don't want to open my yapper until I've done a couple of tests and I need to know how to distribute things on a couple of million rows that I'm going to build.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff

    Here is how the devices are distributed.

    Server_Id, Device_Id

    1 has 1540 Device_Id's

    2 has 742 Device_Id's

    3 has 357 Device_Id's

    4 has 26 Device_Id's

    5 has 917 Device_Id's

  • However, what If I wanted the latest Data_Timestamp value for each device on each server? This approach does not scale well at all.

    what sort of times are you currently returning when including each device?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • I haven't even attempted it. To do so would be to execute 3600 select statements connected by UNION ALL. I cannot imagine that that would do anything good for performance.

Viewing 15 posts - 1 through 15 (of 34 total)

You must be logged in to reply to this topic. Login to reply