March 28, 2007 at 3:36 pm
Hello,
Could someone help me with a query in the following situation? I'm using SQL 2000 in this case.
I have to return one row for a course, but some -- but not all -- courses have multiple instructors in this setup and so a query joining the tables will return two rows for one course if the course has two instructors.
Is there a way to return only one row for each course, whether or not it has multiple professors? I realize that the choice of which professor's row gets chosen will be arbitrary in this case, but that is OK for our purposes. I would just prefer to do this with a query rather than using an intermediate table.
I've tried to use TOP 1, but that returns only the first row of the results or, say, 1 percent of rows, not one row for each value in the course id column. DISTINCT also did not work, because I need to include the instructor name in the output (so DISTINCT returns two rows in those cases).
Here's an example (using sample code as the real code is far too convoluted):
SELECT c.course_id, c.course_title, i.instructor_id, i.instructor_name
FROM courses c INNER JOIN roster r on c.course_id = r.course_id
INNER JOIN instructors i ON r.instructor_id = i.instructor_id
Sample output:
course_id, course_title, instructor_id, instructor_name
009,Intermediate Math,25,John Smith
001,Intro to Math,25,John Smith
001,Intro to Math,31,Jane Doe
007,Advanced Math,45,Isaac Newton
Desired output:
course_id, course_title, instructor_id, instructor_name
009,Intermediate Math,25,John Smith
001,Intro to Math,31,Jane Doe
007,Advanced Math,45,Isaac Newton
OR
course_id, course_title, instructor_id, instructor_name
009,Intermediate Math,25,John Smith
001,Intro to Math,25,John Smith
007,Advanced Math,45,Isaac Newton
If there is any more information needed, just let me know.
Thanks for any help,
webrunner
-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
March 28, 2007 at 4:03 pm
Untested, and I don't have your data, so not optimized in any way.
SELECT
c.Course_ID
,c.Course_Title
,i.Instructor_ID
,i.Instructor_Name
FROM
(
SELECT
Course_ID = r.course_id
,Instructor_ID = Min(r.instructor_id)
FROM
roster AS r
GROUP BY
r.course_id
) AS x
INNER JOIN courses AS c ON
x.course_id = c.course_id
INNER JOIN instructors AS i ON
x.instructor_id = i.instructor_id
March 28, 2007 at 9:22 pm
Hello,
If you prefre not to use an intermediate table, try this:
SELECT c.course_id, c.course_title, i.instructor_id, i.instructor_name
FROM courses c
INNER JOIN roster r on c.course_id = r.course_id
and r.instructor_id = (SELECT MAX(Z.instructor_id) FROM roster Z WHERE Z.course_id = c.course_id)
INNER JOIN instructors i ON r.instructor_id = i.instructor_id
K. Matsumura
March 29, 2007 at 5:40 am
Koji, I'm confused when you say "If you prefer not to use an intermediate table, try this", when we both use the "roster" table, which appears to me to be the intermediate table, from what evidence we have.
March 29, 2007 at 6:31 am
Actually approach from Koji is exactly what they name "hidden sursor".
On big amount of data it will kill performance.
_____________
Code for TallyGenerator
March 29, 2007 at 7:37 am
Sorry, I was unclear in my terms. I meant "temporary" table (not "intermediate" as in linking table for many-to-many relationship).
Thanks to everyone for the suggestions -- I'll test them out.
Another idea is that I am storing the results in a holding table, which I guess in essence can serve as a temporary table of sorts. For example, I can load the holding table, duplicate course rows and all, and add a counter to the rows. Then I can delete all but the minimum counter row for all courses that have more than one record in the table.
Does that sound reasonable?
Thanks again!
---
webrunner
-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
March 29, 2007 at 8:19 am
webrunner, my method should work fine. Derived tables (which is what you are seeing in my query) are not only not typically problematic, but in many cases are the ideal solution. Run it, and if you have problems let us know.
March 29, 2007 at 8:40 am
Both Kofi's and David's suggestions should perform well if you have the tables indexes for those columns. The query optimizer can take some shortcuts and get you quick results.
March 29, 2007 at 9:57 am
When I saw this problem, I immediatly thought of two solutions. One was the method using the subquery to return the Min (or Max) value from the Instructor table and joining it back to the Course table. The second was to simply use the Min (or Max) function in the select list and employ a Group By clause on the other fields in the Select list. I was wondering what the differences in the execution plan might be so I tried and example using the AdventureWorks database (trivial, I know...I didn't include subsequent joins to the Contact information).
Here's my first Query:
SELECT
Stor.CustomerID, Stor.Name, Stor.SalesPersonID, Cont.ContactID
FROM Sales.Store Stor INNER JOIN Sales.StoreContact Cont
ON Stor.CustomerID = Cont.CustomerID
AND Cont.ContactID = (SELECT MIN(Cont2.ContactID)
FROM Sales.StoreContact Cont2
WHERE Cont2.CustomerID = Stor.CustomerID)
And Here's my second:
SELECT Stor.CustomerID, Stor.Name, Stor.SalesPersonID, MIN(Cont.ContactID)
FROM Sales.Store Stor INNER JOIN Sales.StoreContact Cont
ON Stor.CustomerID = Cont.CustomerID
GROUP BY Stor.CustomerID, Stor.Name, Stor.SalesPersonID
Both queryies generate pretty comparable execution plans with one exception, The group by clause causes an aggregation to be performed but the reletive cost is minimal. My question to all of you who are more versed in how queries are executed than me is: What are the drawbacks of the Group By approach in this case? To my eye, it is the simpler approach but I can see why the aggregation might impact performance.
Feel free to pile on now.
March 29, 2007 at 1:24 pm
For the Enthusiasts:
-- Just Updated this post 2:45PM 3/39/2007
My Hypothesis: The first thing that came to my mind, from past experience, is that a derived table in the case of Dave's query, and the group by clause from Chris's query, would essentially be more efficient than that of Koji's when large number of rows exists in these tables.
Here are my overall analysis and conducted controlled testing.
Considering that the appropriate indexes are on the dbo.Course, dbo.Roster, and dbo.Instructor
For my purposes, I did not use these exact tables, but I have followed a similar schema, with the appropriate indexes on Course_ID on both dbo.Course and dbo.Roster tables, also appropriate indexes on Instructor_ID on both dbo.Roster and dbo.Instructor tables, and finally index on Roster_ID on dbo.Roster. My plan, is to examine the execution plan of all 3 queries.
For my analysis, here are my queries:
-- This is similar to Koji's query
select v.Course_ID, a.Instructor_ID, c.Roster_ID
From dbo.Course v
INNER JOIN dbo.Instructor a
on a.Instructor_ID = v.Instructor_ID
INNER Join dbo.Roster c
on a.Instructor_ID = c.Roster_ID
where
c.Roster_ID = (SELECT Min(c2.Roster_ID)
From dbo.Roster c2
Where
c.Roster_ID = c2.Roster_ID
)
-- Similar to Chris's second query
select v.Course_ID,a.Instructor_ID, min(c.Roster_ID)
From dbo.Course v
INNER JOIN dbo.Instructor a
on a.Instructor_ID = v.Instructor_ID
INNER Join dbo.Roster c
on a.Instructor_ID = c.Roster_ID
group by v.Course_ID,a.Instructor_ID
-- Similar to Dave's
select v.Course_ID,ab.Instructor_ID, ab.Roster_ID
From dbo.Course v
INNER JOIN
(
Select a.Instructor_ID, min(c.Roster_ID) AS Roster_ID
from dbo.Instructor a
INNER Join dbo.Roster c
on a.Instructor_ID = c.Roster_ID
group by a.Instructor_ID
) AS ab
ON v.Instructor_ID = ab.Instructor_ID
Essentially I tested these 3 queries out, and the differences in performance were minimal. However, looking at the execution plans, Koji's method costs 51% of the batch (running all 3 queries together), while Dave and Chris's method consume ~25% each. Dave and Chris's methods generated the same execution plan, 2 merge joins, 3 index scans. Koji's method resulted in 3 merge joins and 4 index scans.
Conclusion, when dealing with large number of rows, please follow the derive table method (or Chris’s method, if you can group correctly), this will boost performance compare to using Where ID = (SELECT Min(ID)). Using the where clause will drag down performance when dealing with large number of rows. However, if the rows are minimal, the quickest solution would be to use Koji's method, but keep in mind, to always think/plan ahead for any future potential performance issues.
If there shall be any questions and/or comments, please feel free to contact me via Instant Messaging, my email is mengus_vang@hotmail.com .
Regards
Meng (SQL Server DBA/Developer/Enthusiasts)
MCTS
March 30, 2007 at 2:42 pm
Thanks for all of the in-depth comments!
Due to time, I decided not to work on the query side as suggested above. Instead, I decided to put the values in the holding table and then delete the duplicates afterward.
Thanks again, though. As soon as I get a chance I will return to the query methods described above, taking into account Meng's performance analysis.
---
webrunner
-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply