Get related fields from record containing MAX()

  • I have trouble believing this isn't possible, but searches all over the net have uncovered nothing helpful.

    Succinctly, I want to group by field A, find the MAX() of field B in each 'A' group, then additionally select field C in the record that contains that MAX from each group. The first two steps are simple and work fine. The third step is driving me batty – all I've found are ways to do self-joins on the 'B' field, which drive the execution time from instant to over six minutes, on a not-very-large dataset.

    Is this really such a problem? Most of the tutorials I've located give the example of AVG() as to why mixing group and non-group fields in a SELECT are impossible. There, I can't help but agree – AVG() is a composite over the entire group and it makes no sense to try associating it with a particular record within the group. MAX(), however, is not – there IS a specific record that contains that exact value and I want that record. SQL Server already had that record in its hands once, when it found that MAX() value. Since it has that record, isn't there some simple way to get other fields from that record?

  • Have you tried http://www.sqlservercentral.com/articles/Best+Practices/61537/?

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Self joins of one type or another are generally the way I've done this. It's easy to write a 'triangular join' though.

    There are a few ways to do it - if you post your query & some DDL for the table (inc indexes) we might be able to help with the execution time 🙂

  • Thank you, Gazareth, I do know something about indexing and such. If I get stuck on that, I'll start a new post. My question was about the -CONCEPT- of whether it's really impossible to get fields from particular record once SQL Server has located it, just because the path used to locate it was through a grouping clause. Do you know if that is in fact true, and if so, why? I know there are cases of 'just because', but something this fundamental usually has at least an attempt at some sort of logic behind the design decisions.

  • ... whether it's really impossible to get fields from particular record once SQL Server has located it, just because the path used to locate it was through a grouping clause...

    Something wrong with CONCEPT here:

    1. SQL Server can easily "get fields" from record which has been found ("located")

    2. You cannot locate the record "through a grouping clause".

    That's why relevant DDL, data sample and expected output would help to understand your conceptional question...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • pdanes (5/9/2012)


    Thank you, Gazareth, I do know something about indexing and such. If I get stuck on that, I'll start a new post. My question was about the -CONCEPT- of whether it's really impossible to get fields from particular record once SQL Server has located it, just because the path used to locate it was through a grouping clause. Do you know if that is in fact true, and if so, why? I know there are cases of 'just because', but something this fundamental usually has at least an attempt at some sort of logic behind the design decisions.

    Fair point. I'm not sure exactly how SQL crunches the data, hopefully someone with more knowledge that I will be along!

    At a guess, I'd say SQL doesn't actually have the record (I'm assuming here record == row) with the max value.

    The GROUP BY clause is evaluated before the SELECT, so by that time SQL only has the value of MAX(field B), not a pointer back to the row?

    And if there's duplicate MAX values of field B, which record will SQL choose then?

    All guesswork, but I'd be interested to know the answer!

    Cheers

  • Good point about the duplicate maximums - I hadn't thought of that. It can't happen in this dataset, but in the general case it's certainly a possibility. I suppose you're right, in that SQL Server doesn't keep track of the record where it found the MAX value, just the value itself, and any associated grouping values.

    In any case, I used the self-join, played with the phrasing a little and got my response time back to instant, so I'm in business and this question is mostly academic now. Maybe someone who knows their way around the insides of the query engine will enlighten us. Thanks for the help.

  • Try this:

    IF OBJECT_ID('tempdb..#t') IS NOT NULL

    DROP TABLE #t

    CREATE TABLE #T

    (

    ColA INT

    ,ColB INT

    ,ColC INT

    )

    INSERT INTO #T

    SELECT 1 , 1, 10

    UNION ALL SELECT 1 , 2, 11

    UNION ALL SELECT 1 , 3, 12

    UNION ALL SELECT 2 , 6, 20

    UNION ALL SELECT 2 , 8, 21

    UNION ALL SELECT 2 , 7, 22

    UNION ALL SELECT 3 , 4, 31

    UNION ALL SELECT 3 , 4, 32

    UNION ALL SELECT 3 , 3, 33

    ; WITH CTE AS

    (

    SELECT T.ColA

    ,T.ColB

    ,T.ColC

    ,RN = RANK() OVER (PARTITION BY T.ColA ORDER BY T.ColB DESC )

    FROM #T T

    )

    SELECT *

    FROM CTE

    WHERE CTE.RN = 1

    From what i read from your description, this may be what you are looking for. Else, let us know with sample data, create table structure (as i have done above) along with expected results, we will work on your request.

  • Thanks, Coffee, I did run across some examples on the web using 'OVER', but didn't really understand them. Your example looks quite clear - I will study it and try it out. In the meantime, though, I have a working query, with instantaneous response time, as I wrote when I answered Gazareth, so I don't really need to burden this board with details of my setup.

    My question was more about principles of functionality anyway, i. e. -in general- getting additional fields from a record located by a grouping clause, rather than "How do I make this particular query go faster?"

    I think Gazareth hit it, when he mentioned the possibility of duplicate maximum values - the grouping clause can't be certain that it has a unique record, only that it has an INSTANCE of the maximum value. Since that value cannot guaranteed to be unique (although it is in my case), I can't expect that SQL Server will pin down any particular record and hand it back to me.

    The self-join still seems a bit clunky to me, but just changing the phrasing of the query (no additional indexes or anything like that) apparently enabled the optimizer to get a grip on what I wanted, since the response time dropped from over six minutes to essentially zero. That's good enough for me, but I'm always interested in expanding my skills, so I will definitely study and experiment with your version - thank you for that.

    ***** edit *****

    In fact, looking at (and running) your code just now, it illustrates exactly that duplicate issue - Column B has two instances of the MAX value of 4, and the OVER expression returns them both. It's not an issue for me, since my dataset has no duplicates in that field, but it's probably why 'classic' SQL doesn't allow it. Again, many thanks for the sample code.

  • pdanes (5/9/2012)


    In fact, looking at (and running) your code just now, it illustrates exactly that duplicate issue - Column B has two instances of the MAX value of 4, and the OVER expression returns them both. It's not an issue for me, since my dataset has no duplicates in that field, but it's probably why 'classic' SQL doesn't allow it. Again, many thanks for the sample code.

    You can have a look at the ROW_NUMBER() function. It is similar to RANK() function.

    In the example given ColdCoffee, replacing RANK by ROW_NUMBER will give you only one entry


    Kingston Dhasian

    How to post data/code on a forum to get the best help - Jeff Moden
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • I tried it and I do get only three rows, but what determines which one is returned by the RowNumber function? Or rather, what does the RowNumber function mean? It's obviously some sort of ordinal, but based on what? There is no pattern to it that I can see, at least not in the dataset that Coffee provided with his code.

  • pdanes (5/10/2012)


    I tried it and I do get only three rows, but what determines which one is returned by the RowNumber function? Or rather, what does the RowNumber function mean? It's obviously some sort of ordinal, but based on what? There is no pattern to it that I can see, at least not in the dataset that Coffee provided with his code.

    ,RN = RANK() OVER (PARTITION BY T.ColA ORDER BY T.ColB DESC )

    The Rank or Row_Number is based on what you provide in the OVER clause, in this case ranking will restart at each change in ColA, and will order the rows within the Rank by ColB (descending).

    BOL states:

    Determines the partitioning and ordering of a rowset before the associated window function is applied. That is, the OVER clause defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.

    [font="Courier New"]Looking for a Deadlock Victim Support Group..[/font]
  • pdanes (5/10/2012)


    I tried it and I do get only three rows, but what determines which one is returned by the RowNumber function? Or rather, what does the RowNumber function mean? It's obviously some sort of ordinal, but based on what? There is no pattern to it that I can see, at least not in the dataset that Coffee provided with his code.

    You can look up ROW_NUMBER, RANK, DENSE_RANK functions in Books Online or Google

    I am sure you will get lots of articles and examples on the same which can't be provided by us in forums like these

    ColdCoffee provided a sample script to describe your problem which may not be accurate as per your requirements

    If you provide a similar script describing your requirements, it will be easier for us to give you a tested solution

    You can check the link given in my signature to know how to provide sample data.


    Kingston Dhasian

    How to post data/code on a forum to get the best help - Jeff Moden
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • So far, in my experience, the solution which has performed quickest for this kind of operation is the self join with a proper covering index, it just always seems to perform quicker than the windowed function, largely becaus of the sort operation required when doing it that way.

    Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

  • You can look up ROW_NUMBER, RANK, DENSE_RANK functions in Books Online or Google

    I am sure you will get lots of articles and examples on the same which can't be provided by us in forums like these

    ColdCoffee provided a sample script to describe your problem which may not be accurate as per your requirements

    If you provide a similar script describing your requirements, it will be easier for us to give you a tested solution

    You can check the link given in my signature to know how to provide sample data.

    Regards,

    Kingston

    I fiddled with Coffee's code a little, and among other things, tried this:

    IF OBJECT_ID('tempdb..#t') IS NOT NULL

    DROP TABLE #t

    CREATE TABLE #T

    (

    ColA INT

    ,ColB INT

    ,ColC INT

    )

    INSERT INTO #T

    SELECT 1 , 1, 10

    UNION ALL SELECT 1 , 2, 11

    UNION ALL SELECT 1 , 3, 12

    UNION ALL SELECT 2 , 6, 20

    UNION ALL SELECT 2 , 8, 21

    UNION ALL SELECT 2 , 7, 22

    UNION ALL SELECT 3 , 4, 31

    UNION ALL SELECT 3 , 4, 32

    UNION ALL SELECT 3 , 3, 33

    ; WITH CTE AS

    (

    SELECT T.ColA

    ,T.ColB

    ,T.ColC

    ,RN = RANK() OVER (PARTITION BY T.ColA ORDER BY T.ColB DESC )

    ,RN2 = ROW_NUMBER() OVER (PARTITION BY T.ColA ORDER BY T.ColB DESC )

    FROM #T T

    )

    SELECT *

    FROM CTE

    --WHERE CTE.RN = 1

    This is the result:

    ColA ColB ColC RN RN2

    1 3 12 1 1

    1 2 11 2 2

    1 1 10 3 3

    2 8 21 1 1

    2 7 22 2 2

    2 6 20 3 3

    3 4 31 1 1

    3 4 32 1 2

    3 3 33 3 3

    As you can see, for the two identical values(4) of B, ROW_NUMBER() returns 1 once and 2 once. That's why I asked about the function, because I don't understand on what basis it computes the number it returns. But I clearly have some reading to do, so if you don't know, or do know but don't want to tell me, that's fine - I'll figure it out eventually.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply