How do I get an average of the top two values in a result set ?

  • Todd, I just realized my solution will not work if there is only one instance of any MemberID. I can't test it to be sure right now, but you should beware. I'll see if I can remedy it.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Todd,

    Did you try my correlated subquery solution? I believe it gives the results you require. If you have a large number of rows in your table and performance is an issue, an index on the columns: MemberId, LabDate might be useful.

    SELECT MLROUTER.MemberId, AVG(MLROUTER.LabValue)

    FROM MemberLabResult MLROUTER

    WHERE MLROUTER.MemberLabId IN (

    SELECT TOP 2 MLRINNER.MemberLabId

    FROM MemberLabResult MLRINNER

    WHERE (MLRINNER.LabId = 1)

    AND (MLRINNER.MemberId = MLROUTER.MemberId)

    ORDER BY MLRINNER.LabDate DESC

    )

    GROUP BY MLROUTER.MemberId

    HAVING AVG(MLROUTER.LabValue) <> 4

  • andrewd.smith (4/15/2009)


    Todd,

    Did you try my correlated subquery solution? I believe it gives the results you require. If you have a large number of rows in your table and performance is an issue, an index on the columns: MemberId, LabDate might be useful.

    SELECT MLROUTER.MemberId, AVG(MLROUTER.LabValue)

    FROM MemberLabResult MLROUTER

    WHERE MLROUTER.MemberLabId IN (

    SELECT TOP 2 MLRINNER.MemberLabId

    FROM MemberLabResult MLRINNER

    WHERE (MLRINNER.LabId = 1)

    AND (MLRINNER.MemberId = MLROUTER.MemberId)

    ORDER BY MLRINNER.LabDate DESC

    )

    GROUP BY MLROUTER.MemberId

    HAVING AVG(MLROUTER.LabValue) <> 4

    It works for me, and does not seem to care if there is only one record for any given MemberID.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Andrew -

    I couldn't get the last sample to work but the one you last

    posted works great. I'll pop that into my sproc today.

    I tossed a "having count(*) = 2" on the end to ensure that

    we only take labs with two entries and it's good to go.

    Thanks for everyone's help - this one was a bugger !!

    Todd

  • Todd, it sounds like you are all set with Andrew's code, but I did not want to leave mine out there with a serious flaw. I have fixed it so it does not matter if there is an instance with only one record. So, for what it's worth, if you want to try it, you should be good to go.

    SELECT

    MemberID,

    AVGLabValue = AVG(LabValue)

    FROM

    (

    SELECT

    l.*

    FROM #lab l,

    (

    SELECT

    ID = MIN(id),

    MemberID

    FROM #lab

    GROUP BY MemberID

    ) t1

    WHERE t1.MemberID = l.MemberID --Added this

    AND (t1.id = l.id

    OR t1.id +1 = l.id)

    ) t2

    GROUP BY t2.MemberID

    HAVING AVG(LabValue) 4

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Todd Young (4/16/2009)


    Andrew -

    I couldn't get the last sample to work but the one you last

    posted works great. I'll pop that into my sproc today.

    I tossed a "having count(*) = 2" on the end to ensure that

    we only take labs with two entries and it's good to go.

    Thanks for everyone's help - this one was a bugger !!

    Todd

    I think the reason, my query didn't work first time around was due to the brackets in the TOP (2) clause. This is not a problem in SQL Server 2005 (on which I tested it), but for SQL Server 2000 the brackets need to be removed for the query to be parsed successfully.

    Thanks for the feedback.

    --Andrew

  • Andrew, please understand that I am not picking on you or your code, but I am trying to learn a little bit about performance, since some of the tables in my database are getting large, and code that used to perform well is now tanking, and I have to re-write a lot of it. With that being said, I test both of our solutions against one of Jeff Moden's million row test tables, which can be found at http://www.sqlservercentral.com/articles/T-SQL/63681/. Once the table was created, and I named it 'test', I added the following indexes...

    ALTER TABLE dbo.Test ADD CONSTRAINT

    PK_Test PRIMARY KEY CLUSTERED

    (

    RowNum

    ) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX IX_Company ON dbo.Test

    (

    Company

    ) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX IX_Date ON dbo.Test

    (

    Date

    ) ON [PRIMARY]

    Then I altered both of our code to work on the test table, just to see what would happen. I found that my temp table solution took around 7 seconds, as opposed to the correlated sub query taking 1 minute and 18 seconds. I am still trying to figure out how to interpret the execution plan, to see where the differences are, but I thought I would post the altered code if you wanted to take a look.

    First, the correlated sub query

    SELECT

    MLROUTER.Company,

    AVG(MLROUTER.Amount)

    FROM Test MLROUTER

    WHERE MLROUTER.RowNum IN

    (

    SELECT TOP 2 MLRINNER.RowNum

    FROM Test MLRINNER

    WHERE (MLRINNER.Company = MLROUTER.Company)

    ORDER BY MLRINNER.Date DESC

    )

    GROUP BY MLROUTER.Company

    Then the temp table solution

    IF OBJECT_ID('TempDB..#test','u') IS NOT NULL

    DROP TABLE #test

    SELECT

    IDENTITY(INT,1,1) AS ID,

    Company,

    Date,

    Amount

    INTO #test

    FROM test

    ORDER BY Company,Date DESC

    SELECT

    Company,

    AVG_Val = AVG(Amount)

    FROM

    (

    SELECT

    t.*

    FROM #test t,

    (

    SELECT

    ID = MIN(id),

    Company

    FROM #test

    GROUP BY Company

    ) t1

    WHERE t1.Company = t.Company --Added this

    AND (t1.id = t.id

    OR t1.id + 1 = t.id)

    ) t2

    GROUP BY t2.Company

    Again, I'm not trying to stir anything up, just trying to learn. I can't tell you how many correlated sub queries I've had to fix, because when I wrote them they worked fine on small data sets.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Greg,

    No offense taken. I've also had a few performance issues with correlated subqueries in the past. I was intimating this by this statement in a previous post:

    If you have a large number of rows in your table and performance is an issue, an index on the columns: MemberId, LabDate might be useful.

    I might have a look at an alternative table index this evening. Of course it's the OP's choice as to whether performance is a major consideration, and which solution is more suitable.

    --Andrew

  • Greg,

    I set up some test data with 1000000 rows - 1000 rows for each of 1000 distinct values of Company column and investigated the effect of the following 3 indexing schemes on the 2 queries:

    ALTER TABLE dbo.Test ADD CONSTRAINT

    PK_Test PRIMARY KEY CLUSTERED (RowNum)

    CREATE NONCLUSTERED INDEX IX_Test_Company_Date

    ON dbo.Test(Company, Date)

    ALTER TABLE dbo.Test ADD CONSTRAINT

    PK_Test PRIMARY KEY NONCLUSTERED (RowNum)

    CREATE CLUSTERED INDEX IX_Test_Company_Date

    ON dbo.Test(Company, Date)

    ALTER TABLE dbo.Test ADD CONSTRAINT

    PK_Test PRIMARY KEY CLUSTERED (RowNum)

    CREATE NONCLUSTERED INDEX IX_Test_Company

    ON dbo.Test(Company)

    CREATE NONCLUSTERED INDEX IX_Test_Date

    ON dbo.Test(Date)

    In all cases, your query completed in between 12-16 seconds on my development machine. My query took 62 seconds using the first of the indexing schemes above, and 10 seconds using the second scheme with the clustered index on the Company and Date columns. The performance of my query using the 3rd indexing scheme was so bad that I cancelled the query after 5 minutes with no results returned.

    Therefore, a combination index on the 2 columns Company and Date benefits the correlated subquery performance much more than separate indexes on the 2 columns, and that if this index can be made the clustered index, the performance is similar or slightly better than your temporary table method.

    Clearly the performance of the correlated subquery method is highly dependent on choosing the correct indexes, whereas your method is relatively independent of this, which is not surprising since it's a temporary table solution.

  • Andrew, thanks for looking further into it. I'm not too savy at how different index schemes affect performance, but clearly this is something I need to think about. Thanks for giving me something to think about.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • andrewd.smith (4/16/2009)


    ALTER TABLE dbo.Test ADD CONSTRAINT

    PK_Test PRIMARY KEY CLUSTERED (RowNum)

    CREATE NONCLUSTERED INDEX IX_Test_Company_Date

    ON dbo.Test(Company, Date)

    ALTER TABLE dbo.Test ADD CONSTRAINT

    PK_Test PRIMARY KEY NONCLUSTERED (RowNum)

    CREATE CLUSTERED INDEX IX_Test_Company_Date

    ON dbo.Test(Company, Date)

    ALTER TABLE dbo.Test ADD CONSTRAINT

    PK_Test PRIMARY KEY CLUSTERED (RowNum)

    CREATE NONCLUSTERED INDEX IX_Test_Company

    ON dbo.Test(Company)

    CREATE NONCLUSTERED INDEX IX_Test_Date

    ON dbo.Test(Date)

    Andrew, I have two questions. I'm sitting here trying to get my head around this, and I don't understand what the benefit of having a nonclustered PK would be. Is it only to enforce the unique property of the field? Also, what would happen if you had a clustered index on Company,Date, and there was more than one instance of a combination Company/date? It seems like that would not be allowed. And, I had no idea you can put multiple columns in an index, I'm salivating at the possibilites. Thanks again.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Andrew, now I am very confused. I tried adding a clustered index on a test table on two columns, then inserted what I knew to be a duplicate record, thinking it would throw it out. However, it took it no problem. Now, I should note, that my laptop has SQL 2008, and BOL says that for duplicate entries of a clustered index, it adds a 'uniquefier', which is invisible to the user. I am still not clear on exactly what the indexes do. BOL makes me think the purpose of a clustered index is to sort the rows on the data pages, with the clustered index value serving as the row ID, and thus the restriction to only 1 clustered index. If this is the case, then what does the nonclustered index do? Any way you can explain it in laymen's terms? I will try it again tomorrow on my work server that has SQL 2000, and I am thinking it will not let me insert a duplicate record on a clustered index, but like I said, I am truly confused. Thanks.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • I'm sitting here trying to get my head around this, and I don't understand what the benefit of having a nonclustered PK would be. Is it only to enforce the unique property of the field? Also, what would happen if you had a clustered index on Company,Date, and there was more than one instance of a combination Company/date? It seems like that would not be allowed. And, I had no idea you can put multiple columns in an index

    The benefit of having a nonclustered PK is that it allows you to have a clustered index on other columns that might produce a greater benefit than a clustered primary key would. The point is that you can only have one clustered index on a table, as the clustered index defines the physical ordering of the rows. A clustered index does not have to be unique though in general it should be highly selective, i.e. not many duplicates, else you won't get much benefit from it.

    In cases where the application uses the PK to retrieve individual rows and where the order of the rows when sorted by the PK does not have much intrinsic meaning to the application, there may be a better choice of clustered index. For instance, if one of the most common queries executed by an application is to retrieve a set of rows based on a date range, then including the relevant datetime column in the clustered index may produce a significant performance improvement that outweighs the extra bookmark lookups that may be required to retrieve rows based on a nonclustered PK.

    And yes, you can have multiple columns in an index just as you can have multiple columns in a primary key. Also, if a nonclustered index contains all the columns to be retrieved for a particular query, the query optimizer can avoid the overhead of a bookmark lookup - this is the concept of a "covering index".

    There's plenty of tips on index optimization on this site and others, as well as Books Online. I certainly don't claim to be an expert on it.

  • Thank you Andrew. Your explaination made it start to stick, more than my reading BOL last night. I guess I can lurk in the performance tuning forum for more. Thanks again.

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • I know this has been solved but there's always more than one way to skin a cat and I need to keep my SQL skills up to date.

    So here's my solution..

    SELECTM1.MemberID,

    AVG(M1.Labvalue)

    FROM#MemberLabResult M1

    WHEREM1.LabID = 1

    ANDM1.Labdate IN

    (SELECTLabDate

    FROM#MemberLabResult

    WHEREMemberID = M1.MemberID

    ANDLabdate = (SELECT MAX(LabDate) FROM #MemberLabResult WHERE MemberID = M1.MemberID)

    ORLabdate = (SELECT MAX(LabDate) FROM #MemberLabResult WHERE MemberID = M1.MemberID

    AND Labdate < (SELECT MAX(LabDate) FROM #MemberLabResult WHERE MemberID = M1.MemberID))

    )

    GROUP BY

    M1.MemberID

    HAVINGAVG(M1.Labvalue) != 4

    Or...

    SELECTM1.MemberID,

    AVG(M1.Labvalue)

    FROM#MemberLabResult M1

    WHEREM1.LabID = 1

    ANDM1.Labdate IN

    (SELECTTOP 2 LabDate

    FROM#MemberLabResult

    WHEREMemberID = M1.MemberID

    ORDER BY

    labDate DESC

    )

    GROUP BY

    M1.MemberID

    HAVINGAVG(M1.Labvalue) != 4

    ..Or, if you don't want to use IN (can be inefficient) then:

    SELECTM1.MemberID,

    AVG(M1.Labvalue)

    FROM#MemberLabResult M1

    WHEREM1.LabID = 1

    AND

    (M1.Labdate =

    (SELECTMAX(LabDate)

    FROM#MemberLabResult

    WHEREMemberID = M1.MemberID

    ANDLabdate = (SELECT MAX(LabDate) FROM #MemberLabResult WHERE MemberID = M1.MemberID)

    )

    ORM1.Labdate =

    (SELECTMAX(LabDate)

    FROM#MemberLabResult

    WHEREMemberID = M1.MemberID

    ANDLabdate = (SELECT MAX(LabDate) FROM #MemberLabResult WHERE MemberID = M1.MemberID

    AND Labdate < (SELECT MAX(LabDate) FROM #MemberLabResult WHERE MemberID = M1.MemberID))

    )

    )

    GROUP BY

    M1.MemberID

    HAVINGAVG(M1.Labvalue) != 4

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply