July 1, 2009 at 7:36 am
-- GIVEN 3 TEMPORDERS
#TempOrder
orderId
-------
1 --has 10 orderlines shipped
2 --has 10 orderlines shipped
3 --has 10 orderlines shipped
-- GIVEN 1 BILLION ORDERES IN DATABASE AND AT LEAST 10 BILLION ORDERLINES (ALL ORDERS HAS 10 SHIPPED ORDERLINES)
Order
orderId
-------
1 --has 10 orderlines shipped
2 --has 10 orderlines shipped
3 --has 10 orderlines shipped
4 --has 10 orderlines shipped
5 --has 10 orderlines shipped
: --has 10 orderlines shipped
: --has 10 orderlines shipped
100000000 --has 10 orderlines shipped (1 billionth record)
--RETRIEVE TEMPORDER AND ITS SHIIPPED ORDERLINES
orderId orderLineId
------- -----------
1 1
1 2
1 3
: :
2 1
2 2
2 3
: :
3 1
3 2
3 3
: :
--METHOD1: SUBSELECT
SELECT TempOL.orderId, TempOL.orderLineId
FROM #TempOrder TO
INNER JOIN
(SELECT OL.orderId, OL.orderLineId FROM OrderLine OL WHERE OL.status = 'Shipped') AS TempOL
WHERE TempOL.orderId = O.orderId
--METHOD2: TEMPTABLE
INSERT INTO #TempOrderWithShipped (orderId, orderLineId)
SELECT O.orderId, OL.orderId
FROM #TempOrder O
INNER JOIN OrderLine OL ON OL.orderId = O.orderId AND OL.status = 'Shipped'
SELECT * FROM #TempOrderWithShipped
So the main question is which is more efficient?
I'm pretty new to SQL, but my shallow analysis is this..
METHOD1 will read 1 billion x 10 records then connect the 3 records
METHOD2 will read 3 x 10 records
which makes METHOD2 more efficient
I also like METHOD2 as far as code readability
I'm not looking for coding alternative mainly because the real code that I'm working is much more complicated than the above. But yeah, if you think there's a 3rd way, let me know.
July 1, 2009 at 8:02 am
You're pretty much spot on with your analysis. In the first example the nested query (or derived table as it's commonly referred to) runs first, the results of which are then filtered in the where clause.
The first example is roughly equivalent to inserting the results of nested query into a temp table and then running the outer query whilst joining on this temporary table you created. It's not exactly the same but it illustrates the point.
This isn't to say that derived tables can't be efficient.
By the way, if you use the SET STATISTICS IO ON option (just run this statement once within the connection you're working in) and then run the queries you'll get details relating to the IO impact of the queries, which can often provide very obvious clues as to efficiency.
July 1, 2009 at 8:07 am
Although I hate to say it this way, it depends..
But first, it seems a simple inner join would more than suffice for what you want, just join the header to the details table on the primary keys and specify your filter criteria in the where clause.. Much simpler and no need for sub-select or temp tables..
Now, on to your question..
Part of the considerations for me has always been load time and space to load the temp table, both tend to be expensive operations and a sub-select will often win. I have often done it both ways and tested it to see what was better. We also have the case of temp table vs. table variable which is even stickier. I had a case where a process ran much better up to a point with table variables and then slower later on, and slower using temp tables to a point and then much faster later on. This process was rearchitected to use BOTH because at that defined point even with the copy of data from the table variable to the temp table it was still 40 times faster.
Temp tables tend to be unindexed so you will take a hit on that, and in previous versions even if you defined an index it would never be used due to the way the optimization engine evaluated the batch.
I'm not sure if I answered your question or just gave you more to think about..
CEWII
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply