Using Outer Joins

  • Comments posted to this topic are about the item Using Outer Joins

  • Bad style of the query (I know, it is intended here, but you should never write it this way in real code).

    Instead of

    SELECT 
    c.CustomerName, COUNT(oh.OrderID)
    FROM dbo.Customer AS c
    LEFT JOIN dbo.OrderHeader AS oh ON oh.CustomerID = c.CustomerID
    WHERE oh.Orderdate > '2024/04/01'
    GROUP BY c.CustomerName

    you should better use

    SELECT 
    c.CustomerName, oh.cnt
    FROM dbo.Customer AS c
    LEFT JOIN (SELECT oh.CustomerId, COUNT(*) AS cnt
    FROM dbo.OrderHeader AS oh
    WHERE oh.Orderdate > '2024/04/01'
    ) AS oh
    ON oh.CustomerID = c.CustomerID

    which would be faster (since it aggregates first and just the smaller table) and prevents the common mistake to place a join condition (limiting the Orderdate) in the WHERE instead the ON (or in this case the subquery itself, where it belongs).

    Drawback or bonus of the fixed query: the original one would aggregate two Millers or two Smiths together. This MAY be correct in some circumstances, but usually not when it comes to customer names.

    So it usually would have been correct to group by CustomerID and CustomerName in the original query.

    If this is one of the rare cases, where all Millers needs to be aggregated, you could add the GROUP BY in the second query again and use SUM(oh.cnt) as cnt in the SELECT part. It would now be grouping twice, but usually this will be still faster.

    God is real, unless declared integer.

  • Thomas Franz wrote:

    Bad style of the query (I know, it is intended here, but you should never write it this way in real code).

    Instead of

    SELECT 
    c.CustomerName, COUNT(oh.OrderID)
    FROM dbo.Customer AS c
    LEFT JOIN dbo.OrderHeader AS oh ON oh.CustomerID = c.CustomerID
    WHERE oh.Orderdate > '2024/04/01'
    GROUP BY c.CustomerName

    you should better use

    SELECT 
    c.CustomerName, oh.cnt
    FROM dbo.Customer AS c
    LEFT JOIN (SELECT oh.CustomerId, COUNT(*) AS cnt
    FROM dbo.OrderHeader AS oh
    WHERE oh.Orderdate > '2024/04/01'
    ) AS oh
    ON oh.CustomerID = c.CustomerID

    which would be faster (since it aggregates first and just the smaller table) and prevents the common mistake to place a join condition (limiting the Orderdate) in the WHERE instead the ON (or in this case the subquery itself, where it belongs).

    Drawback or bonus of the fixed query: the original one would aggregate two Millers or two Smiths together. This MAY be correct in some circumstances, but usually not when it comes to customer names.

    So it usually would have been correct to group by CustomerID and CustomerName in the original query.

    If this is one of the rare cases, where all Millers needs to be aggregated, you could add the GROUP BY in the second query again and use SUM(oh.cnt) as cnt in the SELECT part. It would now be grouping twice, but usually this will be still faster.

    Sorry - but your 'corrected' query is not actually correct.  You need to add a GROUP BY in the derived table since you are including the CustomerID in the results and want to return the count by customer.

    I am also not sure that construct would perform better than fixing the first query to remove the WHERE and include that in the LEFT JOIN condition.

    SELECT 
    c.CustomerName, COUNT(oh.OrderID)
    FROM dbo.Customer AS c
    LEFT JOIN dbo.OrderHeader AS oh ON oh.CustomerID = c.CustomerID
    AND oh.Orderdate > '2024/04/01'
    GROUP BY c.CustomerName

    This construct may just end up being the same execution plan as using the derived table.  It would depend on how SQL Server generated the plan - since it could move things around as needed.

    With that said - this may be a better approach:

    SELECT 
    c.CustomerName, TotalOrders = coalesce(ord.OrderCount, 0)
    FROM dbo.Customer AS c
    OUTER APPLY (SELECT OrderCount = count(*)
    FROM dbo.OrderHeader AS oh
    WHERE oh.CustomerID = c.CustomerID
    AND oh.Orderdate > '2024/04/01') AS ord;

    Again - it would depend on indexing and how SQL Server generates the execution plan.  This one also takes care of NULL - returning zero for any customers that don't have any orders.

    But - in regards to the actual question - as written the query is showing what happens when you include the outer referenced table in the WHERE clause.  The purpose of the question was to highlight how rows are removed even when you use an outer join and might expect those rows to be in the results.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • I concur with the error found by Jeff Williams.  I also concur with the OUTER APPLY method being faster.  With the correct index, it's actually twice as fast as the LEFT JOIN method on a million rows in the OrderHeader table. I also converted the tables to Temp Tables because I also do drops when testing.

    Here's the code I used to  create and populate the tables.  I'll let you create whatever indexes you think are needed.

    DROP TABLE IF EXISTS #Customer;
    GO
    CREATE TABLE #Customer
    (CustomerID INT, CustomerName VARCHAR(10))
    ;
    GO
    INSERT INTO #Customer WITH (TABLOCK)
    (CustomerID,CustomerName)
    VALUES (1,'Steve')
    ,(2,'Andy')
    ,(3,'Brian')
    ,(4,'Allen')
    ,(5,'Devin')
    ,(6,'Sally')
    ;
    GO
    DROP TABLE IF EXISTS #OrderHeader;
    GO
    CREATE TABLE #OrderHeader
    (OrderID INT IDENTITY(1,1), CustomerID INT, OrderDate DATE)
    ;
    -- INSERT INTO #OrderHeader WITH (TABLOCK)
    -- (OrderID,CustomerID,OrderDate)
    -- VALUES (1,'2024-02-01')
    -- ,(1,'2024-03-01')
    -- ,(3,'2024-04-01')
    -- ,(4,'2024-05-01')
    -- ,(4,'2024-05-01')
    -- ,(3,'2024-06-07')
    -- ,(2,'2024-04-07')
    --;
    --GO
    INSERT INTO #OrderHeader WITH (TABLOCK)
    (CustomerID,OrderDate)
    SELECT CustomerID = ABS(CHECKSUM(NEWID())%6)+1
    ,OrderDate = DATEADD(dd,ABS(CHECKSUM(NEWID())%DATEDIFF(dd,'2024','2025')),'2024')
    FROM JBMTestSSD.dbo.fnTally(1,1000000) --You need to change this line to suite your system.
    ;
    GO
    SELECT TOP 1000 * FROM #OrderHeader

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This is a topic for which I'd very much appreciate a summary to wrap this up. This is new to me. Something in the form of

    - instead of this

    - do this.

    Thanks.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply