The Reverse Optimization Challenge #1

Question

Post reply

The Reverse Optimization Challenge #1

Garadin

One Orange Chip

Points: 29613
More actions
November 8, 2008 at 9:20 am

#193503

We all spend a lot of time and effort eradicating RBAR and optimizing queries. This thread gives you a chance to do the opposite. We'll start with a very simple query out of the Northwind Database, and each poster can change something about it to make it increasingly less efficient. There are only a few rules
1. You can only use the 4 northwind tables (Customers, Products, Orders and Order Details) and the system tables.
2. You cannot do anything that serves absolutely no purpose or is completely time wasting... for example, you can't create a loop to generate random numbers until you get a particular number before continuing.
3. Explain what you added, and why it's less efficient than it was before / why you shouldn't do it in this scenario.
Beyond these rules, any other T-SQL commands are allowed... but the code should still work after you're done and return the correct result set. This post may be fun and amusing, or it may go completely unread / die out quickly. Guess we'll find out. :hehe: Here is the base query, the 2 commented out lines allow for the first chances at bad query design.
[font="Courier New"]
SELECT CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity, SUM(OD.UnitPrice) * Quantity TotalPrice
   -- Count of Number of Unique Products Per Order Per Customer
   -- Average Cost of item on the order (Total Price of Order / Total of All products * their quantities)
FROM Orders O
   INNER JOIN [Order Details] OD ON O.OrderID = OD.OrderID
   INNER JOIN Customers C ON O.CustomerID = C.CustomerID
   INNER JOIN Products P ON OD.ProductID = P.ProductID
GROUP BY CompanyName,  OrderDate, ProductName, OD.UnitPrice, Quantity[/font]
For those who want to participate and do not have the Northwind database installed on their systems, I've scripted the table DDL into the attachment below. I didn't include all the data, because it'd probably be easier just to install the DB at that point.
Seth Phelabaum
Consistency is only a virtue if you're not a screwup. 😉
Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

why not....

Hint hell:

SELECT CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity, SUM(OD.UnitPrice) * Quantity TotalPrice

FROM Orders O with (INDEX = CustomerID)

INNER LOOP JOIN [Order Details] OD WITH (INDEX = OrderID) ON O.OrderID = OD.OrderID

INNER MERGE JOIN Customers C WITH (INDEX = Region) ON O.CustomerID = C.CustomerID

INNER MERGE JOIN Products P WITH (INDEX = CategoryID) ON OD.ProductID = P.ProductID

GROUP BY CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity

ORDER BY CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity

OPTION (HASH GROUP)

duration doubles

Why you shouldn't do this: The optimiser usually knows best. Hints should only be applied if there's no other way to get the query to perform, if you know exactly what the hint is going to do, exactly why you're applying it and you are 120% sure you know better than the optimiser.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 2

This sounds like fun.

I'll see your Hints Hell, and I'll raise you a Windowed Aggregate function with a misapplied distinct:

select distinct * from

(

SELECT CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity, SUM(OD.UnitPrice) over (partition by CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity) * Quantity TotalPrice

FROM Orders O with (INDEX = CustomerID)

INNER LOOP JOIN [Order Details] OD WITH (INDEX = OrderID) ON O.OrderID = OD.OrderID

INNER MERGE JOIN Customers C WITH (INDEX = Region) ON O.CustomerID = C.CustomerID

INNER MERGE JOIN Products P WITH (INDEX = CategoryID) ON OD.ProductID = P.ProductID

) s

ORDER BY CompanyName, OrderDate, ProductName,UnitPrice, Quantity

OPTION (HASH GROUP)

Double the time from Gail's (4x the original), and worktable logical reads are through the roof.

Why not to do this:

While windowed aggregate functions allow to to derive similar information to a "regular" aggregate function at first glance, it performs the action FOR EACH ROW within the candidate resultset. So - unless you need to derive info correlating the detail row to the grouped aggregate (or comparing multiple groupings to each other), Windowed Aggregates can be a huge drag. ?Also - in general, the new aggregate functions (PIVOT and the windowed aggregates) all seem to be heavy abusers of worktables, so they will jack up your IO stats.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

Gail, your index choices and join choices were 'too good'! 🙂 Try this one:

SELECT CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity, SUM(OD.UnitPrice) over (partition by CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity) * Quantity TotalPrice

FROM Orders O with (INDEX = EmployeeID)

INNER LOOP JOIN [Order Details] OD WITH (INDEX = ProductID) ON O.OrderID = OD.OrderID

INNER LOOP JOIN Customers C WITH (INDEX = Region) ON O.CustomerID = C.CustomerID

INNER LOOP JOIN Products P WITH (INDEX = CategoryID) ON OD.ProductID = P.ProductID

on my box, original duration was 0.2sec with 37 IOs

That version is 1.2 sec (6X) and 53458 IOs (1444X)

Usually it is best to let the optimizer pick both the join types AND the indexing. If you are going to force Indexes, pick ones that are useful for either the join or the where clause.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 4

TheSQLGuru (11/10/2008)
Gail, your index choices and join choices were 'too good'! 🙂

I was trying to see how many sorts I could get in a single query. Answer, quite a few...:D

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Garadin One Orange Chip Points: 29613 More actions · Answer 5

Added in the two extra calculations I had only comments for in my original query as correlated subqueries... one as a CASED correlated subquery.

Temporarily removed the Windowed function and the hash group, as they wouldn't compile on my 2K box.

[font="Courier New"]SELECT CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity,

SUM(OD.UnitPrice) S, --over (partition by CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity) * Quantity TotalPrice,

-- Count of Number of Unique Products Per Order Per Customer

(SELECT COUNT(DISTINCT ProductID) FROM [Order Details] ODI WHERE ODI.OrderID = O.OrderID) DistinctProductCount,

-- Average Cost of item on the order (Total Price of Order / Total of All products * their quantities)

AvgCost = CASE WHEN

ABS((SELECT SUM(Quantity) FROM [Order Details] ODI2 WHERE ODI2.OrderID = O.OrderID)) > 0

THEN

(SELECT SUM(UnitPrice * Quantity) / SUM(Quantity) FROM [Order Details] ODI3 WHERE ODI3.OrderID = O.OrderID)

ELSE

0.00

END

FROM Orders O WITH (INDEX = EmployeeID)

INNER LOOP JOIN [Order Details] OD WITH (INDEX = ProductID) ON O.OrderID = OD.OrderID

INNER MERGE JOIN Customers C WITH (INDEX = Region) ON O.CustomerID = C.CustomerID

INNER MERGE JOIN Products P WITH (INDEX = CategoryID) ON OD.ProductID = P.ProductID

GROUP BY CompanyName, O.OrderID, OrderDate, ProductName, OD.UnitPrice, Quantity

ORDER BY CompanyName, OrderDate, ProductName,OD.UnitPrice, Quantity

--OPTION (HASH GROUP)[/font]

Correlated subqueries CAN (I believe the optimizer sometimes converts them to joins / derived tables for you) be run Row by Row, which is very slow. The same query could be written as derived tables which are only read once. Also, the CASED subquery reads a table twice when it should read only once. The CASE could easily be in a single subquery that if Quantity = 0 then 0.

Definitely need to see a cursor in this thing somehow before we're done!

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

If you want a 'cursor', simply create a UDF that takes an OrderID and calculates the SUM in the output. I leave it to the inquisitive reader to implement this and provide read/duration analysis. 😎

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

Garadin (11/10/2008)
Correlated subqueries CAN (I believe the optimizer sometimes converts them to joins / derived tables for you) be run Row by Row, which is very slow.

More correctly, the optimiser converts them most of the time. The only time I've managed to get a correlated subquery to be executed row by row is when there's an inequality in the expression that relates it to the outer query.

On my 2008 server, this query does not run either of the subqueries row by row. Easy to see, the number of executes on the scans of Order Details is 1. (2 if I don't add a maxdop 1, but that's a parallelism thing). If it was truely been executed multiple times, the scan count of order details would be higher (the number of rows in the outer result set, which is 2155), or there would be an index spool/table spool operator in place to facilitate the multiple executions.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Christopher Stobbs SSC-Insane Points: 21098 More actions · Answer 8

are we allowed to use temp tables for looping? If the loop has a purpose?

----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley

:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]

SQL-4-Life

Garadin One Orange Chip Points: 29613 More actions · Answer 9

GilaMonster (11/10/2008)
Garadin (11/10/2008)
Correlated subqueries CAN (I believe the optimizer sometimes converts them to joins / derived tables for you) be run Row by Row, which is very slow.
More correctly, the optimiser converts them most of the time. The only time I've managed to get a correlated subquery to be executed row by row is when there's an inequality in the expression that relates it to the outer query.
On my 2008 server, this query does not run either of the subqueries row by row. Easy to see, the number of executes on the scans of Order Details is 1. (2 if I don't add a maxdop 1, but that's a parallelism thing). If it was truely been executed multiple times, the scan count of order details would be higher (the number of rows in the outer result set, which is 2155), or there would be an index spool/table spool operator in place to facilitate the multiple executions.

Good to know. So long as we're on the topic, two more questions.

Should the CASE statement actually be different as written than if written with a single select?

If you use a derived table for the SUM aggregate without criteria (Select OrderID, MAX(UnitPrice) FROM Orders) ON ... , would it attempt to resolve a MAX for all orders in the table, or only for ones returned by your outer query? Should you add the same criteria from your outer query into the derived table?

Christopher Stobbs (11/10/2008)
are we allowed to use temp tables for looping? If the loop has a purpose?

Absolutely.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 10

NOW we are starting to get somewhere! :w00t:

alter function dbo.fn_totalprice (@orderid varchar(max))

returns varchar(max)

as

BEGIN

declare @result money

select @result = sum(UnitPrice * Quantity) from [Order Details] with (index = productid) where orderid = @orderid

return cast(@result as varchar(max))

end

SELECT CompanyName, OrderDate, ProductName, OD.UnitPrice, Quantity, cast(dbo.fn_TotalPrice(o.orderid) as money) as TotalPrice

FROM Orders O with (INDEX = EmployeeID)

INNER LOOP JOIN [Order Details] OD WITH (INDEX = ProductID) ON O.OrderID = OD.OrderID

INNER LOOP JOIN Customers C WITH (INDEX = Region) ON O.CustomerID = C.CustomerID

INNER LOOP JOIN Products P WITH (INDEX = CategoryID) ON OD.ProductID = P.ProductID

order by companyname, orderdate, productname, od.unitprice, od.quantity, dbo.fn_totalprice(o.orderid)

Bad stuff:

1) Use of order by that is unnecessary

2) UDF function with ancillary data access

3) inappropriate datatypes (varchar(max) was a good bit worse than char(7950), which took 221K reads)

This set has 391K IOs (10567X or FOUR orders of magnitude higher than baseline), 28.5 seconds runtime (142.5X) and an astounding 10775 WRITES (ZERO for original query)! Query cost is up to 11.1 (61X increase). Not previously reported was the CPU usage: 15 for original query and 10359 now (690X).

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004339 More actions · Answer 11

Heh... remove the ANSI joins and add old fashioned FROM's separated by comma's with no WHERE clause to cause a join. Then, add a distinct or a group by to resolve the duplication.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)