Agregate join

  • I have an agregate query and I want to join a table to that query to add more fields. I do not want to join that table within that query itself because it give me multiple rows. How do I do that?

    Thanks.

  • Use a derived table to return the results of the aggregate, then join that to your secondary table:

    SELECT *

    FROM ( -- The results of the aggregation create the derived table

    SELECT RowID, MAX(RowDate)

    FROM TblA

    GROUP BY RowID

    ) AS drv (RowID, RowDate) INNER JOIN

    TblB ON drv.RowID = TblB.RowID AND

    drv.RowDate = TblB.RowDate

  • This is probably a good case for CROSS APPLY

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Jared

    Jared
    CE - Microsoft

  • Thanks JonFox, that is what I was looking for. Thanks again.

  • p-nut (11/21/2011)


    This is probably a good case for CROSS APPLY

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Jared

    Well...possibly. But if the requirement is just to join aggregated data to rows in another table, I don't know that CROSS APPLYing an inline TVF provides any advantage...well, that I know of, anyway. 🙂

  • p-nut (11/21/2011)


    This is probably a good case for CROSS APPLY

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Jared

    From my experience using CROSS APPLY will force the server to do LOOP JOIN which may be not the best option. So I would suggest to use JonFox's example.


    Alex Suprun

  • Alexander Suprun (11/21/2011)


    p-nut (11/21/2011)


    This is probably a good case for CROSS APPLY

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Jared

    From my experience using CROSS APPLY will force the server to do LOOP JOIN which may be not the best option. So I would suggest to use JonFox's example.

    I disagree. The execution plans in this case will probably be the same (depending on indexes and such). I like the CROSS APPLY here because it is cleaner. Of course, we don't know what the data that the OP looks like because they did not post any DDL or sample data.

    Jared

    Jared
    CE - Microsoft

  • p-nut (11/22/2011)


    Alexander Suprun (11/21/2011)


    p-nut (11/21/2011)


    This is probably a good case for CROSS APPLY

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Jared

    From my experience using CROSS APPLY will force the server to do LOOP JOIN which may be not the best option. So I would suggest to use JonFox's example.

    I disagree. The execution plans in this case will probably be the same (depending on indexes and such). I like the CROSS APPLY here because it is cleaner. Of course, we don't know what the data that the OP looks like because they did not post any DDL or sample data.

    Jared

    I don't see how APPLY can be more cleaner when simple JOIN can be used. Why make things more complicated than they really are? Even MS documentation says:

    http://technet.microsoft.com/en-us/library/ms175156.aspx


    The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query.

    Surely it doesn't mean that you cannot use APPLY in other situations but it shows that this operator has been developed especially to invoke the functions. And if I see in someone's code APPLY instead of JOIN I'll try to find reasons for that, but possibly there are no such. I saw a few times when developers were misusing APPLY because it's more flexible and those who like to process data row-by-row definitely like it but sometimes it results in really bad query plans.

    I didn't have time to find a good example for CROSS APPLY, but here is a simple one with OUTER APPLY vs LEFT JOIN:

    USE AdventureWorks

    GO

    SET SHOWPLAN_TEXT ON

    GO

    SELECT OH.SalesOrderID, OH.OrderDate, OD.ProductID

    FROM Sales.SalesOrderHeader OH

    LEFT JOIN

    (

    SELECT SalesOrderID, MAX(ProductID) AS ProductID

    FROM Sales.SalesOrderDetail

    GROUP BY SalesOrderID

    ) AS OD ON OD.SalesOrderID = OH.SalesOrderID

    SELECT OH.SalesOrderID, OH.OrderDate, OD.ProductID

    FROM Sales.SalesOrderHeader OH

    OUTER APPLY

    (

    SELECT MAX(ProductID) AS ProductID

    FROM Sales.SalesOrderDetail SOD

    WHERE SOD.SalesOrderID = OH.SalesOrderID

    ) AS OD

    |--Merge Join(Right Outer Join, MERGE: ([AdventureWorks].[Sales].[SalesOrderDetail].[SalesOrderID])=([OH].[SalesOrderID]), RESIDUAL: ([AdventureWorks].[Sales].[SalesOrderDetail].[SalesOrderID]=[AdventureWorks].[Sales].[SalesOrderHeader].[SalesOrderID] as [OH].[SalesOrderID]))

    |--Compute Scalar(DEFINE: ([Expr1007]=[Expr1005]))

    | |--Stream Aggregate(GROUP BY: ([AdventureWorks].[Sales].[SalesOrderDetail].[SalesOrderID]) DEFINE: ([Expr1005]=MAX([AdventureWorks].[Sales].[SalesOrderDetail].[ProductID])))

    | |--Clustered Index Scan(OBJECT: ([AdventureWorks].[Sales].[SalesOrderDetail].[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]), ORDERED FORWARD)

    |--Clustered Index Scan(OBJECT: ([AdventureWorks].[Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID] AS [OH]), ORDERED FORWARD)

    |--Nested Loops(Left Outer Join, OUTER REFERENCES: ([OH].[SalesOrderID], [Expr1007]) WITH UNORDERED PREFETCH)

    |--Clustered Index Scan(OBJECT: ([AdventureWorks].[Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID] AS [OH]))

    |--Compute Scalar(DEFINE: ([Expr1005]=[Expr1004]))

    |--Stream Aggregate(DEFINE: ([Expr1004]=MAX([AdventureWorks].[Sales].[SalesOrderDetail].[ProductID] as [SOD].[ProductID])))

    |--Clustered Index Seek(OBJECT: ([AdventureWorks].[Sales].[SalesOrderDetail].[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] AS [SOD]), SEEK: ([SOD].[SalesOrderID]=[AdventureWorks].[Sales].[SalesOrderHeader].[SalesOrderID] as [OH].[SalesOrderID]) ORDERED FORWARD)

    As you can see there is nothing criminal in the 2nd query but server decided to do a LOOP JOIN in this case.


    Alex Suprun

  • Alexander Suprun (11/22/2011)


    p-nut (11/22/2011)


    Alexander Suprun (11/21/2011)


    p-nut (11/21/2011)


    This is probably a good case for CROSS APPLY

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Jared

    From my experience using CROSS APPLY will force the server to do LOOP JOIN which may be not the best option. So I would suggest to use JonFox's example.

    I disagree. The execution plans in this case will probably be the same (depending on indexes and such). I like the CROSS APPLY here because it is cleaner. Of course, we don't know what the data that the OP looks like because they did not post any DDL or sample data.

    Jared

    I don't see how APPLY can be more cleaner when simple JOIN can be used. Why make things more complicated than they really are? Even MS documentation says:

    http://technet.microsoft.com/en-us/library/ms175156.aspx


    The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query.

    Surely it doesn't mean that you cannot use APPLY in other situations but it shows that this operator has been developed especially to invoke the functions. And if I see in someone's code APPLY instead of JOIN I'll try to find reasons for that, but possibly there are no such. I saw a few times when developers were misusing APPLY because it's more flexible and those who like to process data row-by-row definitely like it but sometimes it results in really bad query plans.

    I didn't have time to find a good example for CROSS APPLY, but here is a simple one with OUTER APPLY vs LEFT JOIN:

    USE AdventureWorks

    GO

    SET SHOWPLAN_TEXT ON

    GO

    SELECT OH.SalesOrderID, OH.OrderDate, OD.ProductID

    FROM Sales.SalesOrderHeader OH

    LEFT JOIN

    (

    SELECT SalesOrderID, MAX(ProductID) AS ProductID

    FROM Sales.SalesOrderDetail

    GROUP BY SalesOrderID

    ) AS OD ON OD.SalesOrderID = OH.SalesOrderID

    SELECT OH.SalesOrderID, OH.OrderDate, OD.ProductID

    FROM Sales.SalesOrderHeader OH

    OUTER APPLY

    (

    SELECT MAX(ProductID) AS ProductID

    FROM Sales.SalesOrderDetail SOD

    WHERE SOD.SalesOrderID = OH.SalesOrderID

    ) AS OD

    |--Merge Join(Right Outer Join, MERGE: ([AdventureWorks].[Sales].[SalesOrderDetail].[SalesOrderID])=([OH].[SalesOrderID]), RESIDUAL: ([AdventureWorks].[Sales].[SalesOrderDetail].[SalesOrderID]=[AdventureWorks].[Sales].[SalesOrderHeader].[SalesOrderID] as [OH].[SalesOrderID]))

    |--Compute Scalar(DEFINE: ([Expr1007]=[Expr1005]))

    | |--Stream Aggregate(GROUP BY: ([AdventureWorks].[Sales].[SalesOrderDetail].[SalesOrderID]) DEFINE: ([Expr1005]=MAX([AdventureWorks].[Sales].[SalesOrderDetail].[ProductID])))

    | |--Clustered Index Scan(OBJECT: ([AdventureWorks].[Sales].[SalesOrderDetail].[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]), ORDERED FORWARD)

    |--Clustered Index Scan(OBJECT: ([AdventureWorks].[Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID] AS [OH]), ORDERED FORWARD)

    |--Nested Loops(Left Outer Join, OUTER REFERENCES: ([OH].[SalesOrderID], [Expr1007]) WITH UNORDERED PREFETCH)

    |--Clustered Index Scan(OBJECT: ([AdventureWorks].[Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID] AS [OH]))

    |--Compute Scalar(DEFINE: ([Expr1005]=[Expr1004]))

    |--Stream Aggregate(DEFINE: ([Expr1004]=MAX([AdventureWorks].[Sales].[SalesOrderDetail].[ProductID] as [SOD].[ProductID])))

    |--Clustered Index Seek(OBJECT: ([AdventureWorks].[Sales].[SalesOrderDetail].[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] AS [SOD]), SEEK: ([SOD].[SalesOrderID]=[AdventureWorks].[Sales].[SalesOrderHeader].[SalesOrderID] as [OH].[SalesOrderID]) ORDERED FORWARD)

    As you can see there is nothing criminal in the 2nd query but server decided to do a LOOP JOIN in this case.

    I understand what you are saying in a simple example, but we cannot appropriately help the OP without the DDL. All the OP said was that it was an aggregate query. Did not say sum, max, or any other crazy arithmetical operation. Heck, for all we know it could be a running total.

    Also, CROSS APPLY was not strictly designed for functions. You cannot take Microsoft's descriptions word for word and assume purpose. There are many cases outside of functions that are better and many cases where although the results may be the same for a join versus apply where the apply is much less resource intensive and faster. Again, this depends on certain factors in the DDL, which were not given here.

    As far as "cleaner," I hate nested queries in the FROM clause. It makes script very hard to understand. That's why I say cleaner, of course, you cannot sacrifice performance for cleanliness.

    Jared

    Jared
    CE - Microsoft

  • Maybe the simplest suggestion for cleanliness and readability is a cte assuming this is a simple aggregate:

    ;WITH cte (SalesOrderID, ProductID) AS

    (SELECT SalesOrderID, MAX(ProductID) AS ProductID

    FROM Sales.SalesOrderDetail

    GROUP BY SalesOrderID)

    SELECT OH.SalesOrderID, OH.OrderDate, cte.ProductID

    FROM Sales.SalesOrderHeader OH

    INNER JOIN cte

    ON cte.SalesOrderID = OH.SalesOrderID

    Jared

    Jared
    CE - Microsoft

  • p-nut (11/22/2011)


    I understand what you are saying in a simple example, but we cannot appropriately help the OP without the DDL. All the OP said was that it was an aggregate query. Did not say sum, max, or any other crazy arithmetical operation. Heck, for all we know it could be a running total.

    Also, CROSS APPLY was not strictly designed for functions. You cannot take Microsoft's descriptions word for word and assume purpose. There are many cases outside of functions that are better and many cases where although the results may be the same for a join versus apply where the apply is much less resource intensive and faster. Again, this depends on certain factors in the DDL, which were not given here.

    As far as "cleaner," I hate nested queries in the FROM clause. It makes script very hard to understand. That's why I say cleaner, of course, you cannot sacrifice performance for cleanliness.

    Jared

    I think the discussion is already not about the OP question 🙂 I agree we can say or suggest nothing to OP now.

    If by a chance you find an example where APPLY is much less resource intensive and faster than a JOIN I will be very like to see it.

    If you have to do such kind of aggregation then you cannot avoid nested queries by simply using APPLY. It should be temp table, CTE or something like that. And yes I like using CTEs in such cases as well, especially when there are more than 1 level of nesting.

    Alex


    Alex Suprun

  • An example of the equivalency of the statements on a max compare vs. a cross apply. Adjust as needed.

    This is a pared down testbed I use for this example. Think of a price logging table that you want the most recent price put into the table prior to an invoice so you can run historical invoices with their correct sale prices or different adjustments as marketing tries to get products to move as required.

    Source Data:

    IF OBJECT_ID( 'tempdb..#Items') IS NOT NULL

    DROP TABLE #Items

    CREATE TABLE #Items

    (ItemIDINT IDENTITY(1,1) NOT NULL,

    ItemDescVARCHAR(100),

    SubcategoryIDINT,

    PRIMARY KEY NONCLUSTERED ( ItemDesc)

    )

    CREATE CLUSTERED INDEX idx_Items ON #Items ( ItemID)

    INSERT INTO #Items (ItemDesc, SubcategoryID)

    SELECTNEWID() AS ItemDesc, N/500

    FROMtempdb..Tally

    WHEREN <= 25000

    IF OBJECT_ID( 'tempdb..#PriceHistoryTable') IS NOT NULL

    DROP TABLE #PriceHistoryTable

    CREATE TABLE #PriceHistoryTable

    (AuditIDINT IDENTITY(1,1) NOT NULL,

    ItemIDINT NOT NULL,

    ItemPrice DECIMAL(19,4),

    NumData2 INT,

    UpdateDate DATETIME,

    ReasonForChange VARCHAR(200),

    CommentField VARCHAR(2000)

    )

    CREATE CLUSTERED INDEX idx_TestData

    ON #PriceHistoryTable

    ( UpdateDate, ItemID)

    CREATE NONCLUSTERED INDEX idx_TestData2 ON #PriceHistoryTable

    (ItemID, UpdateDate Desc)

    INSERT INTO #PriceHistoryTable (ItemID, ItemPrice, NumData2, UpdateDate, ReasonForChange, CommentField)

    SELECT

    t.N%2500 AS ItemID, -- Adjust the modulo to decide the data density.

    (t.n + 10) * 2 / 3 AS ItemPrice, -- Simply here as a filler

    t.n AS NumData2, -- Some more filler

    DATEADD( yyyy, 90, DATEADD( ms, ABS( CHECKSUM(NEWID()))%1000, DATEADD( s, ABS( CHECKSUM(NEWID()))%86400, ABS( CHECKSUM(NEWID())% (365 * 20)) )) ) AS UpdateDate, -- Give me a random date

    NULL AS ReasonForChange,

    REPLICATE(( '@', 500) AS CommentField

    FROM

    tempdb..Tally AS t,

    tempdb..Tally AS t2

    WHERE

    t.N <= 50000

    AND t2.N <= 20 --Control your testing rowcount here. 2 for 100k row test, 20 for 1mill.

    Testable code. Remember to run more than once to allow for cache'ing and compiles.

    SET STATISTICS IO, TIME ON

    ---------------------------- MIN/MAX Method

    ; WITH cte AS

    ( SELECT

    ItemId,

    Max( UpdateDate) AS MaxDate

    FROM

    #PriceHistoryTable

    GROUP BY

    ItemId

    )

    SELECT

    td.ItemId,

    td.ItemPrice,

    td.NumData2,

    td.UpdateDate

    FROM

    #PriceHistoryTable AS td

    JOIN

    cte

    ON td.ItemId = cte.ItemId

    AND td.Updatedate = cte.MaxDate

    -----------------------Cross Apply Method

    SELECT

    dst.ItemID,

    ca.ItemPrice,

    ca.NumData2,

    ca.UpdateDate

    FROM

    (SELECT DISTINCT ItemID FROM #PriceHistoryTable) AS dst

    CROSS APPLY

    (SELECT TOP 1 ItemID, ItemPrice, NumData2, UpdateDate

    FROM #PriceHistoryTable AS td2

    WHERE td2.ItemID = dst.ItemID

    ORDER BY UpdateDate DESC) AS ca

    SET STATISTICS IO, TIME OFF

    I've attached my 'full test' code in case you're curious. This is for an article I'll eventually do in comparisons of index mechanics while working with application methods for logging table locatability like this. It's far from complete for my final intentions but I figured you might be curious.

    For a run time idea of the time this takes:

    (2500 row(s) affected)

    Table '#PriceHistoryTable__________________________________________________________________________________________________000000000008'. Scan count 2501, logical reads 13249, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 265 ms, elapsed time = 397 ms.

    (2500 row(s) affected)

    Table '#PriceHistoryTable__________________________________________________________________________________________________000000000008'. Scan count 2501, logical reads 19914, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 172 ms, elapsed time = 270 ms.

    Note the Cross Apply is faster here. This has to do with data density and index optimizations, I've been banging on this particular model for a while using it as a tool to dig at the engine.

    If you review the Execution plans, of primary interest is that A) The cross apply lies about the cost in batch relation... even though it has more reads. and B) The cross apply includes a second nested loop construction.

    It is still more performant. In this case. These two techniques almost always beat out the row_number methods, except for particular data densities. Which one to use is under a lot of circumstances pretty much a coin flip. If you want the particulars you'll have to wait for my article. 😉


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • If by a chance you find an example where APPLY is much less resource intensive and faster than a JOIN I will be very like to see it.

    Check this out: http://social.msdn.microsoft.com/Forums/en/transactsql/thread/6e39d29c-11df-45b5-8a7d-aab536e93acc

    and this:

    http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/

    "While most queries which employ CROSS APPLY can be rewritten using an INNER JOIN, CROSS APPLY can yield better execution plan and better performance, since it can limit the set being joined yet before the join occurs."

    However, I do agree with you that in the case of the unattending OP and the information given that a cte is probably better than a CROSS APPLY.

    Jared

    Jared
    CE - Microsoft

  • If you want the particulars you'll have to wait for my article. 😉

    I can't wait!

    Jared

    Jared
    CE - Microsoft

  • Thanks guys, very interesting stuff, I am definitely going to play with these scripts provided.

    If you want the particulars you'll have to wait for my article. 😉

    I'll be waiting as well!


    Alex Suprun

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply