Selecting one row for each value in another column

  • Hello,

    Could someone help me with a query in the following situation? I'm using SQL 2000 in this case.

    I have to return one row for a course, but some -- but not all -- courses have multiple instructors in this setup and so a query joining the tables will return two rows for one course if the course has two instructors.

    Is there a way to return only one row for each course, whether or not it has multiple professors? I realize that the choice of which professor's row gets chosen will be arbitrary in this case, but that is OK for our purposes. I would just prefer to do this with a query rather than using an intermediate table.

    I've tried to use TOP 1, but that returns only the first row of the results or, say, 1 percent of rows, not one row for each value in the course id column. DISTINCT also did not work, because I need to include the instructor name in the output (so DISTINCT returns two rows in those cases).

    Here's an example (using sample code as the real code is far too convoluted):

    SELECT c.course_id, c.course_title, i.instructor_id, i.instructor_name

    FROM courses c INNER JOIN roster r on c.course_id = r.course_id

    INNER JOIN instructors i ON r.instructor_id = i.instructor_id

    Sample output:

    course_id, course_title, instructor_id, instructor_name

    009,Intermediate Math,25,John Smith

    001,Intro to Math,25,John Smith

    001,Intro to Math,31,Jane Doe

    007,Advanced Math,45,Isaac Newton

    Desired output:

    course_id, course_title, instructor_id, instructor_name

    009,Intermediate Math,25,John Smith

    001,Intro to Math,31,Jane Doe

    007,Advanced Math,45,Isaac Newton

    OR

    course_id, course_title, instructor_id, instructor_name

    009,Intermediate Math,25,John Smith

    001,Intro to Math,25,John Smith

    007,Advanced Math,45,Isaac Newton

    If there is any more information needed, just let me know.

    Thanks for any help,

    webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • Untested, and I don't have your data, so not optimized in any way.

    SELECT

     c.Course_ID

     ,c.Course_Title

     ,i.Instructor_ID

     ,i.Instructor_Name

    FROM

     (

     SELECT

      Course_ID  = r.course_id

      ,Instructor_ID = Min(r.instructor_id)

     FROM

      roster AS r

     GROUP BY

      r.course_id

      ) AS x

     

     INNER JOIN courses AS c ON

      x.course_id = c.course_id

      

     INNER JOIN instructors AS i ON

      x.instructor_id = i.instructor_id

  • Hello,

    If you prefre not to use an intermediate table, try this:

    SELECT c.course_id, c.course_title, i.instructor_id, i.instructor_name

    FROM courses c

    INNER JOIN roster r on c.course_id = r.course_id

    and r.instructor_id = (SELECT MAX(Z.instructor_id) FROM roster Z WHERE Z.course_id = c.course_id)

    INNER JOIN instructors i ON r.instructor_id = i.instructor_id

    K. Matsumura

  • Koji, I'm confused when you say "If you prefer not to use an intermediate table, try this", when we both use the "roster" table, which appears to me to be the intermediate table, from what evidence we have.

  • Actually approach from Koji is exactly what they name "hidden sursor".

    On big amount of data it will kill performance.

    _____________
    Code for TallyGenerator

  • Sorry, I was unclear in my terms. I meant "temporary" table (not "intermediate" as in linking table for many-to-many relationship).

    Thanks to everyone for the suggestions -- I'll test them out.

    Another idea is that I am storing the results in a holding table, which I guess in essence can serve as a temporary table of sorts. For example, I can load the holding table, duplicate course rows and all, and add a counter to the rows. Then I can delete all but the minimum counter row for all courses that have more than one record in the table.

    Does that sound reasonable?

    Thanks again!

    ---

    webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

  • webrunner, my method should work fine. Derived tables (which is what you are seeing in my query) are not only not typically problematic, but in many cases are the ideal solution. Run it, and if you have problems let us know.

  • Both Kofi's and David's suggestions should perform well if you have the tables indexes for those columns. The query optimizer can take some shortcuts and get you quick results.

  • When I saw this problem, I immediatly thought of two solutions.  One was the method using the subquery to return the Min (or Max) value from the Instructor table and joining it back to the Course table.  The second was to simply use the Min (or Max) function in the select list and employ a Group By clause on the other fields in the Select list.  I was wondering what the differences in the execution plan might be so I tried and example using the AdventureWorks database (trivial, I know...I didn't include subsequent joins to the Contact information).

    Here's my first Query:

    SELECT

    Stor.CustomerID, Stor.Name, Stor.SalesPersonID, Cont.ContactID

      FROM Sales.Store Stor INNER JOIN Sales.StoreContact Cont

        ON Stor.CustomerID = Cont.CustomerID

       AND Cont.ContactID = (SELECT MIN(Cont2.ContactID)

                               FROM Sales.StoreContact Cont2

                              WHERE Cont2.CustomerID = Stor.CustomerID)

    And Here's my second:

    SELECT Stor.CustomerID, Stor.Name, Stor.SalesPersonID, MIN(Cont.ContactID)

      FROM Sales.Store Stor INNER JOIN Sales.StoreContact Cont

        ON Stor.CustomerID = Cont.CustomerID

    GROUP BY Stor.CustomerID, Stor.Name, Stor.SalesPersonID

    Both queryies generate pretty comparable execution plans with one exception, The group by clause causes an aggregation to be performed but the reletive cost is minimal.  My question to all of you who are more versed in how queries are executed than me is: What are the drawbacks of the Group By approach in this case?  To my eye, it is the simpler approach but I can see why the aggregation might impact performance.

    Feel free to pile on now.

  • For the Enthusiasts:

    -- Just Updated this post 2:45PM 3/39/2007

    My Hypothesis: The first thing that came to my mind, from past experience, is that a derived table in the case of Dave's query, and the group by clause from Chris's query, would essentially be more efficient than that of Koji's when large number of rows exists in these tables.

    Here are my overall analysis and conducted controlled testing.

    Considering that the appropriate indexes are on the dbo.Course, dbo.Roster, and dbo.Instructor

    For my purposes, I did not use these exact tables, but I have followed a similar schema, with the appropriate indexes on Course_ID on both dbo.Course and dbo.Roster tables, also appropriate indexes on Instructor_ID on both dbo.Roster and dbo.Instructor tables, and finally index on Roster_ID on dbo.Roster. My plan, is to examine the execution plan of all 3 queries.

    For my analysis, here are my queries:

    -- This is similar to Koji's query

    select v.Course_ID, a.Instructor_ID, c.Roster_ID

    From dbo.Course v

    INNER JOIN dbo.Instructor a

    on a.Instructor_ID = v.Instructor_ID

    INNER Join dbo.Roster c

    on a.Instructor_ID = c.Roster_ID

    where

    c.Roster_ID = (SELECT Min(c2.Roster_ID)

    From dbo.Roster c2

    Where

    c.Roster_ID = c2.Roster_ID

    )

     

    -- Similar to Chris's second query

    select v.Course_ID,a.Instructor_ID, min(c.Roster_ID)

    From dbo.Course v

    INNER JOIN dbo.Instructor a

    on a.Instructor_ID = v.Instructor_ID

    INNER Join dbo.Roster c

    on a.Instructor_ID = c.Roster_ID

    group by v.Course_ID,a.Instructor_ID

     

    -- Similar to Dave's

    select v.Course_ID,ab.Instructor_ID, ab.Roster_ID

    From dbo.Course v

    INNER JOIN

    (

    Select a.Instructor_ID, min(c.Roster_ID) AS Roster_ID

    from dbo.Instructor a

    INNER Join dbo.Roster c

    on a.Instructor_ID = c.Roster_ID

    group by a.Instructor_ID

    ) AS ab

    ON v.Instructor_ID = ab.Instructor_ID

     

    Essentially I tested these 3 queries out, and the differences in performance were minimal. However, looking at the execution plans, Koji's method costs 51% of the batch  (running all 3 queries together), while Dave and Chris's method consume ~25% each. Dave and Chris's methods generated the same execution plan, 2 merge joins, 3 index scans. Koji's method resulted in 3 merge joins and 4 index scans.

    Conclusion, when dealing with large number of rows, please follow the derive table method (or Chris’s method, if you can group correctly), this will boost performance compare to using Where ID = (SELECT Min(ID)). Using the where clause will drag down performance when dealing with large number of rows. However, if the rows are minimal, the quickest solution would be to use Koji's method, but keep in mind, to always think/plan ahead for any future potential performance issues.

     

    If there shall be any questions and/or comments, please feel free to contact me via Instant Messaging, my email is mengus_vang@hotmail.com .

     

    Regards

     

    Meng (SQL Server DBA/Developer/Enthusiasts)

    MCTS

  • Thanks for all of the in-depth comments!

    Due to time, I decided not to work on the query side as suggested above. Instead, I decided to put the values in the holding table and then delete the duplicates afterward.

    Thanks again, though. As soon as I get a chance I will return to the query methods described above, taking into account Meng's performance analysis.

    ---

    webrunner

    -------------------
    A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
    Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply