May 23, 2013 at 10:00 am
My system has a number of history tables that would give the date that a value has changed.
I found a very close analogue to this in Adventure works looking up products and seeing what the most recent value was as of a certain date.
I can think of two ways to do this:
declare @ReportingDate DATE = '9/7/2004'
-- Method 1
SELECT
Name
,ReportDate = @ReportingDate
,MostRecentOrderDetailId = (SELECT TOP 1 SalesOrderDetailId FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = Product.ProductID AND SalesOrderDetail.ModifiedDate <= @ReportingDate ORDER BY ModifiedDate DESC)
FROM Production.Product Product
-- Method 2
SELECT
Name
,ReportDate = @ReportingDate
,MostRecent.MostRecentOrderDetailId
FROM Production.Product Product
OUTER APPLY (SELECT TOP 1 MostRecentOrderDetailId = SalesOrderDetailId FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = Product.ProductID AND SalesOrderDetail.ModifiedDate <= @ReportingDate ORDER BY ModifiedDate DESC) MostRecent
Both produce a near identical query plan. Method 2 would be a clear winner of the two if I wanted to retrieve more than a single value from the detail table (say price and quantity).
Is there a better way to do this overall without refactoring how the data is stored?
Best performance I can think of would be to have a daily reporting table that precalculates the requested values cross applied with a dates table to give the value for every date. Then it would be a simple join rather than a subquery/cross apply.
Even using a reporting table (DailySalesOrderDetail), these queries need to be optimized because they'll be needed to recalculate the the reporting whenever the SalesOrderDetail table changes.
IF OBJECT_ID('tempdb..#DailyOrderDetail') IS NOT NULL DROP TABLE #DailyOrderDetail
SELECT DateFull, Name, MostRecentOrderDetailId
INTO #DailyOrderDetail
FROM edimain.dbo.DateLookup
CROSS APPLY (SELECT
Name
,MostRecentOrderDetailId = (SELECT TOP 1 SalesOrderDetailId FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = Product.ProductID AND SalesOrderDetail.ModifiedDate <= DateFull ORDER BY ModifiedDate DESC)
FROM Production.Product Product) MostRecent
WHERE DateFull BETWEEN '1/1/2002' and '1/1/2005'
CREATE NONCLUSTERED INDEX IX_DailyOrderDetail_DateFull
ON #DailyOrderDetail ([DateFull])
INCLUDE ([Name],[MostRecentOrderDetailId])
declare @ReportingDate DATE = '9/7/2004'
-- Method 3
SELECT
d.Name
,ReportDate = @ReportingDate
,MostRecentOrderDetailId = d.MostRecentOrderDetailId
FROM Production.Product Product
INNER JOIN #DailyOrderDetail D on D.DateFull = @ReportingDate AND D.Name = Product.Name
Given the reporting table already existing, Method 3 (just a join to a denormalized reporting table) doesn't even register as a single percentage point in comparison to Method 1 and Method 2.
So if I could get Method 1 or Method 2 optimized further, that would be an interesting exercise. My application uses it quite a bit an it would be useful for optimizing a reporting table to use Method 3 as well.
Is there a term for what sort of query this is?
May 23, 2013 at 10:12 am
tl;dr
1) I think pre-aggregated data will always smoke real-time queries 🙂
2) since you are on SQL 2012, have a look at LAST_VALUE. It might be a better option than your TOP 1 subquery. NOTE: the default window clause (RANGE BETEEN...) is HORRIBLY inefficient currently. Be sure to use ROWS BETWEEN...
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
May 23, 2013 at 1:28 pm
The First_Value/Last_Value sounds really promising, but I'm not quite getting it yet.
-- Method 1.5a
SELECT DISTINCT
Product.Name
,ReportDate = @ReportingDate
,MostRecentOrderDetailId = (SELECT LAST_VALUE(SalesOrderDetailId) OVER (PARTITION BY Product.ProductId order by SalesOrderDetail.ModifiedDate)
FROM sales.SalesOrderDetail SalesOrderDetail WHERE SalesOrderDetail.ProductID = product.ProductID
AND SalesOrderDetail.ModifiedDate <= @ReportingDate)
FROM Production.Product Product
-- Method 1.5b
SELECT DISTINCT
Product.Name
,ReportDate = @ReportingDate
,LAST_VALUE(SalesOrderDetailId) OVER (PARTITION BY Product.ProductId order by SalesOrderDetail.ModifiedDate)
FROM Production.Product Product
LEFT JOIN sales.SalesOrderDetail SalesOrderDetail ON SalesOrderDetail.ProductID = product.ProductID
1.5a returns the fun error:
Msg 512, Level 16, State 1, Line 11
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
When I try to add grouping to either, the window function I get this error:
Column 'sales.SalesOrderDetail.ModifiedDate' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
For every product I'm trying to get a single value (the Last_Value before a date), but I seem to be doing rather poorly at using Last_Value and getting it to group or return just a single value.
Any advice on what I'm overlooking here?
May 24, 2013 at 1:37 am
This is a variation of the 'Top N per Group' problem, and there are many ways to express it in T-SQL. Whichever you choose, be sure to make the ORDER BY clause deterministic and provide a useful index. A deterministic ORDER BY simply means there should be no ties. In the AdventureWorks example, TOP (1) ... ORDER BY ModifiedDate DESC is not deterministic per product because multiple order detail rows can have the same ModifiedDate. A useful index for that query is:
CREATE INDEX nc1
ON Sales.SalesOrderDetail
(
ProductID,
ModifiedDate DESC,
SalesOrderDetailID DESC
);
Writing the query to be deterministic:
SELECT
p.Name,
ReportDate = @ReportingDate,
MostRecent.SalesOrderDetailID
FROM Production.Product AS p
OUTER APPLY
(
SELECT TOP (1)
sod.SalesOrderDetailID
FROM Sales.SalesOrderDetail AS sod
WHERE
sod.ProductID = p.ProductID
AND sod.ModifiedDate <= @ReportingDate
ORDER BY
sod.ModifiedDate DESC,
sod.SalesOrderDetailID DESC
) AS MostRecent;
The query plan no longer features an index spool or sort, seeking straight to the required row per product:
May 24, 2013 at 2:34 am
How about this solutions...
--- Create a InLine Function with ReportingDate Parameter --
CREATE FUNCTION fnMostRecentOrderDetail(@ReportingDate datetime) RETURNS TABLE
AS
RETURN
(
SELECT ProductId, SalesOrderDetailId, ModifiedDate,
ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ModifiedDate DESC) RowNo
FROM sales.SalesOrderDetail
WHERE ModifiedDate < @ReportingDate
)
GO
and then...
-- Join the function with ReportingDate Parameter --
declare @ReportingDate DATE = '9/7/2007'
SELECT
Name
,ReportDate = @ReportingDate
,MostRecent.ModifiedDate
FROM Production.Product Product
LEFT OUTER JOIN fnMostRecentOrderDetail(@ReportingDate) MostRecent ON Product.ProductID = MostRecent.ProductId and RowNo = 1
May 24, 2013 at 5:14 am
Add index suggested in Paul's reply, to improve performance of your query.
May 24, 2013 at 8:57 am
Paul White (5/24/2013)
This is a variation of the 'Top N per Group' problem, and there are many ways to express it in T-SQL. Whichever you choose, be sure to make the ORDER BY clause deterministic and provide a useful index...
Thank you for the correction! yes, on the query in my system that looks into history is deterministic, my AdventureWorks example wasn't as well thought out as it could be 🙂
More importantly giving a good example of what the right index should get you in a query plan
May 24, 2013 at 9:12 am
Bhaskar.Shetty (5/24/2013)
How about this solutions...>>SQL Removed<<
Hmm, thank you. I hadn't quite thought of that as a third option to the Sub query and cross apply. In fact if I'd made a ITF, I think my mind would have drifted towards a cross apply
I like how it's clean and I'm interested to see how it the RowNumber compares to the Top 1 performance wise. I'll post a followup comparison 🙂
May 30, 2013 at 1:26 pm
Bhaskar.Shetty (5/24/2013)
How about this solutions...
--- Create a InLine Function with ReportingDate Parameter --
CREATE FUNCTION fnMostRecentOrderDetail(@ReportingDate datetime) RETURNS TABLE
AS
RETURN
(
SELECT ProductId, SalesOrderDetailId, ModifiedDate,
ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ModifiedDate DESC) RowNo
FROM sales.SalesOrderDetail
WHERE ModifiedDate < @ReportingDate
)
GO
and then...
-- Join the function with ReportingDate Parameter --
declare @ReportingDate DATE = '9/7/2007'
SELECT
Name
,ReportDate = @ReportingDate
,MostRecent.ModifiedDate
FROM Production.Product Product
LEFT OUTER JOIN fnMostRecentOrderDetail(@ReportingDate) MostRecent ON Product.ProductID = MostRecent.ProductId and RowNo = 1
I have a solution very similar now.
Currently it's a inline table function that takes a product id and date, then returns the one reference back to the record.
It uses the Row_Number() ORDER BY... where RowNumber = 1, but inside of the function, so it always just returns a single row.
Currently it's not in the AdventureWorks db, it's in my actual db. But I'm interested in this sort of thing, so I'll probobly post up here the three different methods I'm seeing and the differences. AdventureWorks might be too small to really differentiate the performance, but we'll see.
We have the Top 1 Desc (starting), Return Row_Number 1 for a very specific item (my new) and Return Row_Numbers and join on 1 to limit the results (actually returns multiple) that you gave me.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply