Request help with a query

  • HowardW (9/15/2009)


    The thing is I suspect it would be slower with a large dataset. If you want to preserve both rows in the event of a tie, use rank() rather than row_number().

    It might look uglier, but it's generally the cheapest solution for this problem.

    Forgive me if I choose to not trust your suspicions. I chose the largest table in AdventureWorks, and the execution plans show the same relative costs. Admittedly, the plan for the IN query depends on a seek on the clustered index, but I'm sure we could each devise examples to support our proclivities.

    SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate

    FROM Sales.SalesOrderDetail d

    WHERE LineTotal IN

    (SELECT MIN(LineTotal)

    FROM Sales.SalesOrderDetail

    WHERE SalesOrderID = d.SalesOrderID)

    ORDER BY SalesOrderID, SalesOrderDetailID;

    ------------------------------------------

    WITH cte AS (SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate,

    RANK() OVER(PARTITION BY SalesOrderID ORDER BY LineTotal) AS SeqID

    FROM Sales.SalesOrderDetail)

    SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate

    FROM cte WHERE SeqID = 1;



    --Jonathan

  • Jonathan (9/15/2009)


    SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate

    FROM Sales.SalesOrderDetail d

    WHERE LineTotal IN

    (SELECT MIN(LineTotal)

    FROM Sales.SalesOrderDetail

    WHERE SalesOrderID = d.SalesOrderID)

    ORDER BY SalesOrderID, SalesOrderDetailID;

    ------------------------------------------

    WITH cte AS (SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate,

    RANK() OVER(PARTITION BY SalesOrderID ORDER BY LineTotal) AS SeqID

    FROM Sales.SalesOrderDetail)

    SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate

    FROM cte WHERE SeqID = 1;

    This is the output i get from set statistics io on .

    Non Cte

    Table 'SalesOrderDetail'. Scan count 31466, logical reads 103081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    CTE Version

    Table 'SalesOrderDetail'. Scan count 1, logical reads 1241, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.



    Clear Sky SQL
    My Blog[/url]

  • I do not know why such a long query has been posted.

    ;with cte as (select *,ROW_NUMBER() over(partition by vehicle order by dueHours) as seqID from @sample )

    select * from cte where seqID = 1

    compared to

    SELECT * FROM @sample s

    WHERE DueHours IN (SELECT MIN(DueHours) FROM @sample WHERE Vehicle = s.Vehicle);

    This is a joke right? There no significant difference.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • 1) It preserves information when there are ties. Showing only one row in the result set removes what could be relevant information. In fact the CTE query given will return an arbitrary result in case of ties. If ties are to be broken, include some agreed-upon business logic to do so, e.g. PARTITION BY Vehicle ORDER BY DueHours, ID.

    This is valid commentary, but it depends on whether the application anticipates/desires more than one row. As has been pointed out, RANK() can be substituted for ROW_NUMBER() when preservation of ties is desirable. Adding additional ORDER BY criteria works for either solution.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • 2) It works with versions of SQL Server older than SQL Server 2005.

    True, and I used to code that way before better techniques were made available in 2005/2008. There are LOTS of ways to solve problems, but we are trying to focus on the most system-efficient ways.

    I'm old school, but that doesn't mean I will make long trips by horse instead of by car.

    Embrace the new. 🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • 3) It is no slower than the CTE.

    Don't trust the workload estimations from the visual display of execution plans.

    Look at the I/O stats for a clue as to the real workloads. It may not have been perceptibly slower to you for the Adventureworks test, but then Adventureworks out of the box isn't really that big. You will find in most cases using real-world volumes that the ROW_NUMBER() technique outperforms the subquery.

    Please understand, I didn't invent this technique and take no pride of authorship in it. It's not a case of "my" solution compared to "your" solution. ROW_NUMBER() is simply more efficient most of the time.

    That said, the answer to almost every question in SQL is "It depends." If anyone wants to post situations in which other techniques outperform the ROW_NUMBER() technique, I would be happy to hear them and make note of them for future reference.

    Edited: Having been advised to review my comments for something to be ashamed of, and having failed to find in my search, I will have to settle for this with respect to the terms "real world" and "most". Whatever environment you are working in is your "real world" and "most" is equally subjective.

    Ultimately everyone agrees that competing solutions must be tested in the environment in which they are to run.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Dave Ballantyne (9/15/2009)


    This is the output i get from set statistics io on .

    Non Cte

    Table 'SalesOrderDetail'. Scan count 31466, logical reads 103081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    CTE Version

    Table 'SalesOrderDetail'. Scan count 1, logical reads 1241, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    And your point is? The table is the table. Both queries cache the same data, so the above is not particularly relevant to the performance. Try taking actual timings; I always get the IN query as faster than the CTE query.



    --Jonathan

  • Bob Hovious (9/15/2009)


    I do not know why such a long query has been posted.

    ;with cte as (select *,ROW_NUMBER() over(partition by vehicle order by dueHours) as seqID from @sample )

    select * from cte where seqID = 1

    compared to

    SELECT ID, Vehicle, DueHours, DueName

    FROM @sample s

    WHERE DueHours IN

    (SELECT MIN(DueHours)

    FROM @sample

    WHERE Vehicle = s.Vehicle);

    Huh? Yes, the second is shorter, but why do you care?



    --Jonathan

  • Bob Hovious (9/15/2009)


    2) It works with versions of SQL Server older than SQL Server 2005.

    True, and I used to code that way before better techniques were made available in 2005/2008. There are LOTS of ways to solve problems, but we are trying to focus on the most system-efficient ways.

    I'm old school, but that doesn't mean I will make long trips by horse instead of by car.

    Embrace the new. 🙂

    Then I disagree with your definition of "better" in this case. If a more agnostic solution works as well, I will usually choose it.

    If I'm traveling cross country, over fences, I'll take my hunter over your Mercedes.



    --Jonathan

  • And I'll take a helicopter over your hunter.

    Next you get a sparrow missile to shoot down my helicopter.... 😉

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (9/15/2009)


    And I'll take a helicopter over your hunter.

    Next you get a sparrow missile to shoot down my helicopter....

    Does this mean you've found a method that's faster than the IN query in my AdventureWorks example?

    As we've both stated, one should test with one's own particular schema and data. I don't doubt that the CTE will be faster in some circumstances, but that doesn't necessarily make it "better," particularly if it is deployed on SQL Server 2000. 🙂



    --Jonathan

  • Jonathan (9/15/2009)


    Bob Hovious (9/15/2009)


    And I'll take a helicopter over your hunter.

    Next you get a sparrow missile to shoot down my helicopter....

    Does this mean you've found a method that's faster than the IN query in my AdventureWorks example?

    As we've both stated, one should test with one's own particular schema and data. I don't doubt that the CTE will be faster in some circumstances, but that doesn't necessarily make it "better," particularly if it is deployed on SQL Server 2000. 🙂

    Well, since CTE's aren't available in SQL Server 2000, that argument is mute. As the OP requested help in a SQL Server 2005 forum, he should expect a SQL Server 2005 answer.

    As for which is faster, I'll leave that little competition for others right now.

  • I wouldn't have offered the cte/row_number() solution at all if this was the T-SQL 2000 forum... because it wouldn't even compile. This is T-SQL(SS2K5). What we are looking for here are the best solutions for the 2005 environment, not 2000.

    I'm a little busy at work right now. I'll send a more thorough response when I have more time.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Okay.... I knocked this out before taking off for lunch. Now anybody reading can satisfy themselves using Jonathan's measurement of time to execute, since he doesn't seem to think that logical reads count. There is probably a cleaner test harness setup, but I'm hungry and in a hurry.

    To avoid confusing screen display times with runtimes, I'm just tracking the milliseconds to write to a temporary table. On my laptop, the CTE/RowNumber solution ran in an average of 645 milliseconds and the Subquery solution averaged 1519 ms. That means the CTE solution was twice as fast for this test, which was only 120,000 rows.

    EDITED: The original post had the set @timer = getdate() statement in the wrong place. This has been corrected. The CTE solution was not twice as fast (at 120,000 rows on my laptop), but it still ran in only 70% of the time the subquery ran. Read on and revised numbers will be posted.

    select count(*) as SalesOrderDetail from sales.salesOrderDetail

    -- set up temp table to catch results so we aren't concerned with screen display times

    if object_id(N'tempdb..#temp1') is not null drop table #temp1

    if object_id(N'tempdb..#temp2') is not null drop table #temp2

    SELECT SalesOrderID, SalesOrderDetailID+0 as SalesOrderDetailID

    ,CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate

    INTO #temp1 from sales.salesOrderDetail

    WHERE 1 = 0

    SELECT * into #temp2 from #temp1

    -- get ready to rummmmmbleeeee

    declare @timer datetime

    declare @results table (queryType varchar(30), Millisecs int)

    declare @x int

    set @x = 1

    while @x <= 100

    begin

    set @timer = getdate()

    INSERT INTO #temp1 (SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate)

    SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate

    FROM Sales.SalesOrderDetail d

    WHERE LineTotal IN

    (SELECT MIN(LineTotal)

    FROM Sales.SalesOrderDetail

    WHERE SalesOrderID = d.SalesOrderID)

    ORDER BY SalesOrderID, SalesOrderDetailID;

    insert into @results

    select 'Subquery',datediff(ms,@timer,getdate())

    set @timer = getdate();

    ------------------------------------------

    WITH cte AS (SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate,

    RANK() OVER(PARTITION BY SalesOrderID ORDER BY LineTotal) AS SeqID

    FROM Sales.SalesOrderDetail)

    INSERT INTO #temp1 (SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate)

    SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, SpecialOfferID, UnitPrice,

    UnitPriceDiscount, LineTotal, ModifiedDate

    FROM cte WHERE SeqID = 1;

    insert into @results

    select 'CTE/RowNumber',datediff(ms,@timer,getdate())

    set @x = @x+1

    end

    select queryType, avg(millisecs) as avg

    from @results

    group by queryType

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • From previous debates comparing things like derived tables vs. correlated subqueries, I believe the result always comes down to the optimizer.

    Very simple subqueries with outer references are typically converted to joins by the optimizer. So even a basic performance statement such as 'derived tables are better than correlated subqueries' isn't always true in practice. Sometimes the correlated subqueries actually run faster (As much as it annoys me when it happens).

    It was commonly believed that EXISTS with an outer reference would be beaten by a LEFT JOIN with a NULL check. Actual testing proved this to be wrong in many situations, even though the left JOIN seems like it should win.

    Unfortunately, I'm not adept enough at reading execution plans to fully analyze this situation, but I can tell you that the execution plan changed more for the IN query on a table with 3.3 million rows than it did for the CTE. (Nested loops became a merge join in the IN method) I can also tell you that a derived table query with no outer references has the same execution plan on this 3.3M row dataset that the IN/outer reference query does. Just because a query LOOKS like it will be less efficient doesn't always mean it will be.

    I also seem to remember windowed functions like RowNumber starting to drag at a certain number of rows, so the distribution of data / number of rows to number/rank could also come into play here.

    Basically this brings us back to the most common saying in SQL. "It Depends".

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Viewing 15 posts - 16 through 30 (of 49 total)

You must be logged in to reply to this topic. Login to reply