November 6, 2018 at 7:42 am
All,
I'm working on building a semantic layer within an ETL database so as to improve our end to end performance. As part of this, I'm generating a layer on top of the extraction layer to generate a number of aggregations with smaller sets (I use that term loosely) of data on a different schema to test performance improvements (if any) compared to the existing plan.
Without getting too far into the new design, one of my queries generates a sum aggregation as below
SELECT
fund.SourceName,
fund.FundID,
trans.InvestorID,
fundDates.theDate AS CurrentDate,
SUM(trans.Capital * (CASE WHEN batch.BatchGLDate <= fundDates.theDate AND batch.BatchStatus = 'Posted' THEN 1 ELSE 0 END)) AS CurrentValue
FROM TransAlloc trans
INNER JOIN
Batch batch ON trans.BatchID = batch.BatchID
INNER JOIN
Fund fund ON batch.LEID = fund.FundID
INNER JOIN
FundDates fundDates ON fund.FundID = fundDates.FundID
WHERE batch.batchStatus IN ( 'Posted', 'Held')
AND batch.BatchGLDate <= fundDates.theDate
GROUP BY
fund.SourceName,
fund.FundID,
trans.InvestorID,
fundDates.theDate
Trans table has 475k records Clustered index on TranID
Batch table 365k Clustered index on BatchID
Fund 1344 Clustered index on FundID
(I've also created other covering indexes on these tables)
FundDates 417984 - this table is populated with every date since '2017-12-31' for each fund:
;WITH allDates AS
(SELECT CAST('2017-12-31' AS DATETIME) as theDate
UNION ALL
SELECT DATEADD(day, 1, allDates.theDate)
FROM allDates
WHERE DATEADD(day, 1, theDate) <= GETDATE()
)
SELECT fund.FundID, d.theDate
INTO FundDates
FROM Fund fund,
allDates d
OPTION (MAXRECURSION 0)
The initial rollout will run extremely slow due to the fact that the code needs to check on the conditions in the CASE statement for every batch date against each date in the funddates table (as well as if the batch status = 'Posted').
I'm wondering if anyone could recommend another way of doing this.
Bear in mind, this is just one aggregation, I have several others to be added to the query.
Other options explored:
1. Creating this as a table valued function with the FundID and Date as params. This returns very quickly but only runs for one date and one fund. I haven't tried using this within a dynamic SQL query to test performance yet, but this option would allow us to re-run for specific dates for specific funds.
2. Create the aggregation as a sub-query and join it to the main query.
I can post the table definitions if need be but mostly I'm just looking for a nudge in the right direction in terms of performance benefit with such a query. Any help is appreciated.
Regards
November 6, 2018 at 9:29 am
It's very difficult to troubleshoot performance problems without an execution plan (preferably the actual rather than estimated). I did look at your query and see that you had a triangular join, which contributes to the performance problems. I've tried to rewrite the query to get rid of the triangular join.
;
WITH Batch_totals AS
(
SELECT
batch.batchID
batch.BatchGLDate,
batch.batchStatus,
CASE WHEN batch.batchStatus = 'Held' THEN 0 ELSE COUNT(batch.batchStatus) OVER(PARTITION BY batch.batchID, batch.batchStatus ORDER BY BatchGLDate ROWS UNBOUNDED PRECEDING) END AS CurrentCount
FROM
Batch
WHERE batch.batchStatus IN ( 'Posted', 'Held')
)
SELECT
fund.SourceName,
fund.FundID,
trans.InvestorID,
fundDates.theDate AS CurrentDate,
trans.Capital * batch.CurrentCount AS CurrentValue
FROM TransAlloc trans
INNER JOIN
Fund fund ON batch.LEID = fund.FundID
INNER JOIN
FundDates fundDates ON fund.FundID = fundDates.FundID
CROSS APPLY
(
SELECT TOP (1) CurrentCount
FROM
Batch_totals
WHERE trans.BatchID = batch_totals.BatchID
AND batch_totals.BatchGLDate <= fundDates.theDate
ORDER BY batch_totals.batchStatus DESC, batch_totals.BatchGLDate DESC
) batch
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
November 6, 2018 at 10:25 am
Triangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)
Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.
However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).
The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
Thanks again for the help
November 6, 2018 at 11:49 am
mitzyturbo - Tuesday, November 6, 2018 10:25 AMTriangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).
The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
Thanks again for the help
No, you shouldn't be able to use "ELSE 1". The whole purpose of the windowed function is to replace the SUM. I had to guess at the formula, because you didn't provide sample data and expected results.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
November 6, 2018 at 7:46 pm
mitzyturbo - Tuesday, November 6, 2018 10:25 AMTriangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).
The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
Thanks again for the help
If anyone has more interest in "Triangular Joins", here's the article where that graphic came from.
http://www.sqlservercentral.com/articles/T-SQL/61539/
--Jeff Moden
Change is inevitable... Change for the better is not.
November 7, 2018 at 8:22 am
I got a big improvement by introducing a new index based on batchdate and batchstatus (disregard the cover index name)
CREATE NONCLUSTERED INDEX [IDX_NI_CA_Batch_CoverIndex2] ON [dbo].[Batch]
(
[BatchGLDate] ASC
,[BatchStatus]
)
INCLUDE ([BatchSDSID]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Right now, its taking just over 44 seconds to generate just under 60k rows for one fund between 2017-12-31 and '2018-02-01'.
Execution plan attached, the main bottlenecks are the SORTs used in both the CTE and the CROSS APPLY query.
On the back of this I tried to take some of the burden off the CROSS APPLY query by adding a row number in the CTE and apply a filter for the first record, avoiding the use of a TOP 1:
;
WITH Batch_totals AS
(
SELECT
batch.batchSDSID,
batch.BatchGLDate,
batch.batchStatus,
CASE WHEN batch.batchStatus = 'Held' THEN 0
ELSE COUNT(batch.batchStatus)
OVER(PARTITION BY batch.batchSDSID, batch.batchStatus ORDER BY BatchGLDate ROWS UNBOUNDED PRECEDING)
END AS CurrentCount,
ROW_NUMBER() OVER (PARTITION BY batch.batchSDSID ORDER BY BatchGLDate DESC, batch.batchStatus DESC) AS RowNum
FROM
SDSDW.Batch batch
WHERE batch.batchStatus IN ( 'Posted', 'Held')
AND batch.batchGLDate BETWEEN '2017-12-31' AND '2018-02-01'
)
SELECT
fund.SourceName,
fund.FundID,
trans.InvestorID,
fundDates.theDate AS CurrentDate,
trans.TotalPartnersCapital * batch.CurrentCount AS CurrentValue
FROM SDSDW.TransAlloc trans
INNER JOIN SDSDW.Fund fund
ON trans.LESDSID = fund.FundSDSID
INNER JOIN SDSDW.FundDates fundDates
ON fund.FundSDSID = fundDates.FundSDSID
CROSS APPLY
(
SELECT CurrentCount
FROM
Batch_totals
WHERE trans.BatchSDSID = batch_totals.BatchSDSID
AND batch_totals.BatchGLDate <= fundDates.theDate
AND Rownum = 1
) batch
WHERE trans.LESDSID = 891
AND fundDates.theDate BETWEEN '2017-12-31' AND '2018-02-01'
ORDER BY fundDates.theDate
Huge gain, only took 1 second to return the 60k rows and totals match.
I'll expand this out to see how it handles the different calc's (some will require different criteria) but this has been a huge help Drew
Thanks again
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply