Using SUM() In CASE Statement / Best Practices

Question

Using SUM() In CASE Statement / Best Practices

mitzyturbo

Hall of Fame

Points: 3452
More actions
November 6, 2018 at 7:42 am

#409291

All,
I'm working on building a semantic layer within an ETL database so as to improve our end to end performance. As part of this, I'm generating a layer on top of the extraction layer to generate a number of aggregations with smaller sets (I use that term loosely) of data on a different schema to test performance improvements (if any) compared to the existing plan.
Without getting too far into the new design, one of my queries generates a sum aggregation as below
SELECT
        fund.SourceName,
        fund.FundID,
        trans.InvestorID,
        fundDates.theDate AS CurrentDate,
        SUM(trans.Capital * (CASE WHEN batch.BatchGLDate <= fundDates.theDate AND batch.BatchStatus = 'Posted' THEN 1 ELSE 0 END)) AS CurrentValue
FROM TransAlloc trans
        INNER JOIN
Batch batch ON trans.BatchID = batch.BatchID
        INNER JOIN
Fund fund ON batch.LEID = fund.FundID
        INNER JOIN
FundDates fundDates ON fund.FundID = fundDates.FundID
    WHERE    batch.batchStatus IN ( 'Posted', 'Held')
     AND    batch.BatchGLDate <= fundDates.theDate
    GROUP BY
        fund.SourceName,
        fund.FundID,
        trans.InvestorID,
   fundDates.theDate
Trans table has 475k records Clustered index on TranID
Batch table 365k Clustered index on BatchID
Fund 1344 Clustered index on FundID
(I've also created other covering indexes on these tables)
FundDates 417984 - this table is populated with every date since '2017-12-31' for each fund:
;WITH allDates AS
(SELECT CAST('2017-12-31' AS DATETIME) as theDate
  UNION ALL
  SELECT DATEADD(day, 1, allDates.theDate)
   FROM allDates
   WHERE DATEADD(day, 1, theDate) <= GETDATE()
)
SELECT fund.FundID, d.theDate
INTO FundDates
FROM Fund fund,
allDates d
OPTION (MAXRECURSION 0)
The initial rollout will run extremely slow due to the fact that the code needs to check on the conditions in the CASE statement for every batch date against each date in the funddates table (as well as if the batch status = 'Posted').
I'm wondering if anyone could recommend another way of doing this.
Bear in mind, this is just one aggregation, I have several others to be added to the query.
Other options explored:
1. Creating this as a table valued function with the FundID and Date as params. This returns very quickly but only runs for one date and one fund. I haven't tried using this within a dynamic SQL query to test performance yet, but this option would allow us to re-run for specific dates for specific funds.
2. Create the aggregation as a sub-query and join it to the main query.
I can post the table definitions if need be but mostly I'm just looking for a nudge in the right direction in terms of performance benefit with such a query. Any help is appreciated.
Regards

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

drew.allen SSC Guru Points: 77009 More actions · Answer 1

It's very difficult to troubleshoot performance problems without an execution plan (preferably the actual rather than estimated). I did look at your query and see that you had a triangular join, which contributes to the performance problems. I've tried to rewrite the query to get rid of the triangular join.

; WITH Batch_totals AS ( SELECT batch.batchID batch.BatchGLDate, batch.batchStatus, CASE WHEN batch.batchStatus = 'Held' THEN 0 ELSE COUNT(batch.batchStatus) OVER(PARTITION BY batch.batchID, batch.batchStatus ORDER BY BatchGLDate ROWS UNBOUNDED PRECEDING) END AS CurrentCount FROM Batch WHERE batch.batchStatus IN ( 'Posted', 'Held') ) SELECT fund.SourceName, fund.FundID, trans.InvestorID, fundDates.theDate AS CurrentDate, trans.Capital * batch.CurrentCount AS CurrentValue FROM TransAlloc trans INNER JOIN Fund fund ON batch.LEID = fund.FundID INNER JOIN FundDates fundDates ON fund.FundID = fundDates.FundID CROSS APPLY ( SELECT TOP (1) CurrentCount FROM Batch_totals WHERE trans.BatchID = batch_totals.BatchID AND batch_totals.BatchGLDate <= fundDates.theDate ORDER BY batch_totals.batchStatus DESC, batch_totals.BatchGLDate DESC ) batch

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

mitzyturbo Hall of Fame Points: 3452 More actions · Answer 2

Triangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)

Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.

However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).

The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
Thanks again for the help

drew.allen SSC Guru Points: 77009 More actions · Answer 3

mitzyturbo - Tuesday, November 6, 2018 10:25 AM
Triangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)
Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.
However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).
The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
Thanks again for the help

No, you shouldn't be able to use "ELSE 1". The whole purpose of the windowed function is to replace the SUM. I had to guess at the formula, because you didn't provide sample data and expected results.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jeff Moden SSC Guru Points: 1004576 More actions · Answer 4

mitzyturbo - Tuesday, November 6, 2018 10:25 AM
Triangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)
Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.
However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).
The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
Thanks again for the help

If anyone has more interest in "Triangular Joins", here's the article where that graphic came from.
http://www.sqlservercentral.com/articles/T-SQL/61539/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mitzyturbo Hall of Fame Points: 3452 More actions · Answer 5

I got a big improvement by introducing a new index based on batchdate and batchstatus (disregard the cover index name)
CREATE NONCLUSTERED INDEX [IDX_NI_CA_Batch_CoverIndex2] ON [dbo].[Batch]
(
[BatchGLDate] ASC
,[BatchStatus]
)
INCLUDE ([BatchSDSID]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Right now, its taking just over 44 seconds to generate just under 60k rows for one fund between 2017-12-31 and '2018-02-01'.

Execution plan attached, the main bottlenecks are the SORTs used in both the CTE and the CROSS APPLY query.

On the back of this I tried to take some of the burden off the CROSS APPLY query by adding a row number in the CTE and apply a filter for the first record, avoiding the use of a TOP 1:
;
WITH Batch_totals AS
(
SELECT
   batch.batchSDSID,
   batch.BatchGLDate,
   batch.batchStatus,
   CASE WHEN batch.batchStatus = 'Held' THEN 0
        ELSE COUNT(batch.batchStatus)
            OVER(PARTITION BY batch.batchSDSID, batch.batchStatus ORDER BY BatchGLDate ROWS UNBOUNDED PRECEDING)
            END AS CurrentCount,
        ROW_NUMBER() OVER (PARTITION BY batch.batchSDSID ORDER BY BatchGLDate DESC, batch.batchStatus DESC) AS RowNum
FROM
   SDSDW.Batch batch
WHERE batch.batchStatus IN ( 'Posted', 'Held')
    AND batch.batchGLDate BETWEEN '2017-12-31' AND '2018-02-01'

)

SELECT
fund.SourceName,
fund.FundID,
trans.InvestorID,
fundDates.theDate AS CurrentDate,
trans.TotalPartnersCapital * batch.CurrentCount AS CurrentValue
FROM SDSDW.TransAlloc trans
INNER JOIN    SDSDW.Fund fund
    ON trans.LESDSID = fund.FundSDSID
INNER JOIN    SDSDW.FundDates fundDates
    ON fund.FundSDSID = fundDates.FundSDSID
CROSS APPLY
(
    SELECT CurrentCount
FROM
   Batch_totals
WHERE trans.BatchSDSID = batch_totals.BatchSDSID
   AND batch_totals.BatchGLDate <= fundDates.theDate
        AND Rownum = 1
) batch
WHERE trans.LESDSID = 891
    AND fundDates.theDate BETWEEN '2017-12-31' AND '2018-02-01'
ORDER BY fundDates.theDate

Huge gain, only took 1 second to return the 60k rows and totals match.
I'll expand this out to see how it handles the different calc's (some will require different criteria) but this has been a huge help Drew
Thanks again

Using SUM() In CASE Statement / Best Practices

Cookies on SQLServerCentral