best practice for returning when matching a MAX()

  • I want to return an different column for each row having a MAX(). There doesn't seem to be an intuitive way to do this. For example:

    select topicID, lastPosterID

    where postDate=MAX(postDate)

    the problem comes in grouping. SQL wants you to group by everything in the select list, and you can't just grab the other data on that row. This forces you to join back, like so:

    select topicID, lastPosterID

    from tbl

    join (

    select topicID, max(postDate)

    group by topicID

    ) "match" on topicID=topicID and postDate=postDate

    Is this how it *has to* be, or is there a better way? Can MAX() OVER() be leveraged somehow?

  • There's a method of converting everything to strings, then finding the MAX, then separating out substrings and casting back as necessary. It can be rather cumbersome for multiple columns, however. It should look something like this, I think:

    SELECT topicID, lastPosterID =

        SUBSTRING (

          MAX( CONVERT(CHAR(23), postDate, 121)

              + CAST(lastPosterID AS CHAR(10))

             ), 24, 10)

    FROM tbl

    GROUP BY topicID

     

    Tell me if it works for you. I forgot the name of the guy who told me this technique (in a $99 day seminar in 1995 about aggregating along trees...he has a book on that subject).

     

  • SELECT     d.TopicID,

               d.LastPosterID,

               d.PostDate

    FROM       (

                          SELECT     TopicID,

                                     LastPosterID,

                                     PostDate,

                                     ROW_NUMBER() OVER (PARTITION BY TopicID ORDER BY PostDate DESC) AS RecID

                          FROM       Tbl

               ) AS d

    WHERE      d.RecID = 1

    ORDER BY   d.TopicID

     


    N 56°04'39.16"
    E 12°55'05.25"

  • Aaron, that's an interesting approach. Basically, you're appending the data you want to use to the end of the field you're sorting on. Once sorted, you simply break off the piece you need. I wonder how you'd extend it to include a third field and how to sort in different directions. Sometimes "old-school" shouldn't be discounted.

    Peter, yeah, that's one way. I can't help thinking that it should be even simpler, though. I can't believe that I'm the only one to need this from time to time.

  • For multiple fields, you could retrieve the concatenated chunk in a subquery, then break it into pieces in the parent query.

    Instead of Peter's technique, I think you could use TOP 1 somehow...

    Results 1 - 10 of about 218,000,000 for top 1 for each group

    #2 http://expertanswercenter.techtarget.com/eac/knowledgebaseAnswer/0,295199,sid63_gci980115,00.html

     

  • select topicID, lastPosterID
    from tbl
    where postDate = (select max(postDate) from tbl)

    The problem is that, unless you truncate the dates when entering them into the table, you will still only return the last entry made. If you want to see every entry made on the same day as the last entry made:

    select topicID, lastPosterID
    from tbl
    where DateDiff( dd, postDate, (select max(postDate) from tbl)) = 0

    This could be a killer for efficiency. If the postDate field is not indexed, it would end up doing do two complete table scans. If it is indexed, it would still do one.

    If this is a large table and this is a query you make often, there are a couple of possibilities.

    • truncate the date when you insert it:
      insert into tbl (..., postDate, ...)
      values(..., DateAdd( DateDiff( dd, @PostDate, 0), 0), ...)

    • or, if you want to keep the time value, create another field to hold the truncated value:
      insert into tbl (..., postDate, postDay, ...)
      values(..., @PostDate, DateAdd( DateDiff( dd, @PostDate, 0), 0), ...)

      Then you can substitute postDay for postDate in my first query above.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

  • Thanks for your reply, Tomm. But you seem to be missing the point of the GROUP BY -- the idea is if I have, say 5 topics, I want the latest one in each. The essence of your query (and most other suggestions) is a subquery, essentially doing the lookup twice. I was wondering if there's a more efficient way.

  • Maybe you could post some sample data and a sample of what you want the output to look like.

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

  • Try this, you're looking for a derived table:

    select

    topicid, lastposterid, postdate

    from tbl

    inner join (

    select topicid, max(postdate) as maxpost from tbl group by topicid) lastpost on lastpost.topicid = tbl.topicid and tbl.postdate = lastpost.maxpost

    Joe

  • Not exactly a model of efficiency, but here goes:

     

    select

    topicID, lastPosterID, postdate

    from tbl

    where convert(char(10),postDate,101) in

    (select max(convert(char(10),postDate,101)) from tbl)

  • Very interesting approach, so I put it to the test to see how efficient it really was.  Sorry to say, it wasn't. 

    I did the following:

    Table DDL:  -- FYI the table has roughtly 200k rows.

    CREATE TABLE [tblApplicationLog](

     [FromApplication] [varchar](50) NOT NULL,

     [EntryTime] [datetime] NOT NULL,

    ...

    CREATE CLUSTERED INDEX [IX_tblApplicationLog] ON [tblApplicationLog]

    (

     [FromApplication] ASC,

     [WhenEntered] ASC

    )

    --Standard Approach to the problem

    select * from

    tblApplicationLog A

    Inner Join ( select FromApplication,max(EntryTime) MET from tblApplicationLog

    group by FromApplication

    ) B

    ON B.FromApplication = A.FromApplication

    and B.MET = A.EntryTime

    -- Your Approach

    SELECT * FROM (

    SELECT ROW_NUMBER() OVER (PARTITION BY FromApplication ORDER BY EntryTime DESC) RN,

    * FROM tblApplicationLog) B

    WHERE B.RN=1

    ORDER BY B.FromApplication

    -- Alternate approach (random rows returned from another post-- This I really liked).

    SELECT * FROM (

    SELECT ROW_NUMBER() OVER (PARTITION BY FromApplication ORDER BY NEWID() DESC) RN,

    * FROM tblApplicationLog) B

    WHERE B.RN=1

    ORDER BY B.FromApplication

    Results. 

    Roughly 375ms each for the two row_number() methods, and only 6ms for the standard approach. 

    Another issue, is the actual query returns 21 rows (some rows with the same EntryTime) but your approach only returns 17 

    I still may find a use for that technique, and it is a good one to have in my bag of tricks, but the best way to do a max, don't think so.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply