July 10, 2019 at 9:30 am
Hi Guys.
Is there a way to select the distinct ref number where the date is the newest date and acc is equal to LEX. So the results would be:
SELECT *
INTO #SampleData2
FROM (
VALUES ('BA', '12345', '2019-01-01', 'LEX'),
('BA', '12345','2019-01-02', 'LEX'),
('BA', '12345','2019-01-03', 'CASH'),
('BA', '12345','2019-01-04', 'CASH'),
('BA', '12345','2019-01-05', ''),
('BA', '12346', '2019-01-01', 'LEX'),
('BA', '12346','2019-01-02', 'LEX'),
('BA', '12346','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12347','2019-01-02', 'LEX'),
('BA', '12347','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12346','2019-01-19', ''),
('BA', '12346','2019-01-20', ''),
('BA', '12347','2019-01-16', 'LEX'),
('BA', '12347','2019-01-19', 'CASH')
) AS d (Name, ref, DATE, ACC);
SELECT * from #SampleData2
July 10, 2019 at 10:21 am
Yes, the common way is using a CTE and ROW_NUMBER
or using ROW_NUMBER
in the ORDER BY
and TOP 1 WITH TIES
. There's plenty of examples on the internet on how to do this (and I'm sure a Google would have given you the answer), but:
WITH CTE AS(
SELECT SD2.Name,
SD2.ref,
SD2.DATE,
SD2.ACC,
ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX')
SELECT C.[Name],
C.ref,
C.[DATE],
C.ACC
FROM CTE C
WHERE C.RN = 1;
SELECT TOP 1 WITH TIES
SD2.[Name],
SD2.ref,
SD2.[DATE],
SD2.ACC
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX'
ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC);
Thom~
Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk
July 10, 2019 at 10:38 am
Perfect thank you very much
July 10, 2019 at 12:12 pm
Here's a method that doesn't use ROW_NUMBER()
;WITH CTE AS
(
SELECT DISTINCT SD2.Name,
SD2.ref
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX'
)
SELECT x.*
FROM CTE
CROSS APPLY(SELECT TOP(1) *
FROM #SampleData2 x
WHERE x.Name = CTE.Name
AND x.ref = CTE.ref
AND x.ACC = 'LEX'
ORDER BY x.DATE DESC) x
July 10, 2019 at 2:20 pm
Here's a method that doesn't use ROW_NUMBER()
It also requires two scans of the table and is approximately twice as expensive as a result. It will always perform worse than the ROW_NUMBER() approach.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
July 10, 2019 at 2:26 pm
Jonathan AC Roberts wrote:Here's a method that doesn't use ROW_NUMBER()
It also requires two scans of the table and is approximately twice as expensive as a result. It will always perform worse than the ROW_NUMBER() approach. Drew
I'm sure I've had this argument before. The method I suggested is actually faster than the ROW_NUMBER() provided there is an index on ACC, Name, ref and also that data has on average more than about 4 rows for each ACC, Name, ref. The query can then do a seek, which is faster than the large sort needed for ROW_NUMBER().
So no, it won't always perform worse as it sometimes performs better.
July 10, 2019 at 4:00 pm
Just out of curiosity, I created semi-valid test of these three sets of code. I took the code and inserted it into a regular table, and used the GO 111 to insert the same rows. Not a valid distribution of data for an actual test, but it's good enough. It ended up with 12768 rows.
Both queries came up with a recommended index, so I added them.
CREATE CLUSTERED INDEX [IX99_SampleData2]
ON [dbo].[SampleData2] ([ACC])
GO
USE [Junk]
GO
CREATE NONCLUSTERED INDEX [IX98_SampleData2]
ON [dbo].[SampleData2] ([Name],[Ref],[DATE])
GO
Drew's first example returns in 12 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SampleData2'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Drew's second example returns in 15 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SampleData2'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Both do a index seek of the clustered index, and a sort. The second one does an additional sort.
Jonathan's code took 16 ms.
Table 'SampleData2'. Scan count 4, logical reads 116, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
The execution plan shows two index scans of the non-clustered index, which I expected with the use of the DISTINCT keyword.
Which code won?
From a purely "style" point of view, the use of DISTINCT is more often than not a huge red flag when I have to tune code. I just spent a month re-writing code that fills a set of summary tables for a reporting database that used DISTINCT everywhere. All of the tables contained bad data that was caused by the blind use of distinct, I prefer the row_number method. I may be a bit prejudiced, but...
Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/
July 10, 2019 at 4:20 pm
Hi Michael, The indexes recommended by SQL Server are not always the best, the index you would need to try is:
CREATE NONCLUSTERED INDEX [IX98_SampleData3]
ON [dbo].[SampleData2] (ACC,[Name],[Ref],[DATE]) INCLUDE (allOtherColumns)
If you paste in the code you used to generate the test data I'll take a look.
July 10, 2019 at 4:55 pm
Hi Michael, The indexes recommended by SQL Server are not always the best, the index you would need to try is:
CREATE NONCLUSTERED INDEX [IX98_SampleData3]
ON [dbo].[SampleData2] (ACC,[Name],[Ref],[DATE]) INCLUDE (allOtherColumns)If you paste in the code you used to generate the test data I'll take a look.
Agreed. You'll also want to try the following:
CREATE NONCLUSTERED INDEX [IX98_SampleData3]
ON [dbo].[SampleData2] (ACC,[Ref],[DATE] DESC) INCLUDE (allOtherColumns)
Having the date sorted descending in the index means that the sort isn't required in the CTE for the ROW_NUMBER().
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
July 10, 2019 at 5:10 pm
I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.
set statistics io, time off
set nocount on
go
DROP TABLE #SampleData2
GO
SELECT *
INTO #SampleData2
FROM (
VALUES ('BA', '12345', '2019-01-01', 'LEX'),
('BA', '12345','2019-01-02', 'LEX'),
('BA', '12345','2019-01-03', 'CASH'),
('BA', '12345','2019-01-04', 'CASH'),
('BA', '12345','2019-01-05', ''),
('BA', '12346', '2019-01-01', 'LEX'),
('BA', '12346','2019-01-02', 'LEX'),
('BA', '12346','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12347','2019-01-02', 'LEX'),
('BA', '12347','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12346','2019-01-19', ''),
('BA', '12346','2019-01-20', ''),
('BA', '12347','2019-01-16', 'LEX'),
('BA', '12347','2019-01-19', 'CASH')
) AS d (Name, ref, DATE, ACC);
GO
insert INTO #SampleData2
SELECT *
FROM (
VALUES ('BA', '12345', '2019-01-01', 'LEX'),
('BA', '12345','2019-01-02', 'LEX'),
('BA', '12345','2019-01-03', 'CASH'),
('BA', '12345','2019-01-04', 'CASH'),
('BA', '12345','2019-01-05', ''),
('BA', '12346', '2019-01-01', 'LEX'),
('BA', '12346','2019-01-02', 'LEX'),
('BA', '12346','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12347','2019-01-02', 'LEX'),
('BA', '12347','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12346','2019-01-19', ''),
('BA', '12346','2019-01-20', ''),
('BA', '12347','2019-01-16', 'LEX'),
('BA', '12347','2019-01-19', 'CASH')
) AS d (Name, ref, DATE, ACC);
GO 10000
--SELECT * from #SampleData2
CREATE NONCLUSTERED INDEX [IX98_SampleData3] ON #SampleData2 (ACC,[Name],[Ref],[DATE])
PRINT '
CTE TOP(1) With Index'
set statistics time on
;WITH CTE AS
(
SELECT DISTINCT SD2.Name,
SD2.ref
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX'
)
SELECT x.*
FROM CTE
CROSS APPLY(SELECT TOP(1) x.Name,x.ref,x.DATE,x.ACC
FROM #SampleData2 x
WHERE x.Name = CTE.Name
AND x.ref = CTE.ref
AND x.ACC = 'LEX'
ORDER BY x.DATE DESC) x
set statistics io, time off
GO
DROP INDEX [IX98_SampleData3] ON #SampleData2
GO
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])
GO
CREATE NONCLUSTERED INDEX [IX98_SampleData2] ON #SampleData2([Name],[Ref],[DATE])
GO
PRINT '
ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN'
set statistics time on
;WITH CTE AS(
SELECT SD2.Name,
SD2.ref,
SD2.DATE,
SD2.ACC,
ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX')
SELECT C.[Name],
C.ref,
C.[DATE],
C.ACC
FROM CTE C
WHERE C.RN = 1;
PRINT '
ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC)'
SELECT TOP 1 WITH TIES
SD2.[Name],
SD2.ref,
SD2.[DATE],
SD2.ACC
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX'
ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC);
set statistics io, time off
DROP INDEX [IX99_SampleData2] ON #SampleData2
GO
DROP INDEX [IX98_SampleData2] ON #SampleData2
GO
-- DROP INDEX [IX98_SampleData3] ON #SampleData2
July 10, 2019 at 7:49 pm
I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.
I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries.
Replace this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])
with this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
July 10, 2019 at 8:26 pm
Jonathan AC Roberts wrote:I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.
I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries. Replace this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])with this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)Drew
The indexes I put in were those suggested by Michael which SQL Server had recommended.
The change to the index you suggested made the CTE with ROW_NUMBER query about the same as the CTE with TOP(1) , the TOP(1) WITH TIES is about twice as slow.
I didn't realise that DESC would make any difference with the performance of the index, I thought it would just reduce fragmentation of the index on inserts if the rows are inserted in order.
July 10, 2019 at 10:12 pm
drew.allen wrote:Jonathan AC Roberts wrote:I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.
I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries. Replace this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])with this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)Drew
The indexes I put in were those suggested by Michael which SQL Server had recommended. The change to the index you suggested made the CTE with ROW_NUMBER query about the same as the CTE with TOP(1) , the TOP(1) WITH TIES is about twice as slow. I didn't realise that DESC would make any difference with the performance of the index, I thought it would just reduce fragmentation of the index on inserts if the rows are inserted in order.
I think it actually increases fragmentation, because records are typically added in increasing date order rather than decreasing date order. It improves efficiency, because the index is organized in the same order as the OVER() clause for the ROW_NUMBER(), so you don't have to sort the results to get the correct ROW_NUMBER().
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
July 11, 2019 at 1:07 am
I think it actually increases fragmentation, because records are typically added in increasing date order rather than decreasing date order. It improves efficiency, because the index is organised in the same order as the OVER() clause for the ROW_NUMBER(), so you don't have to sort the results to get the correct ROW_NUMBER(). Drew
Yes I did mean that, I thought the only reason you would put DESC is to reduce fragmentation if the rows were inserted in descending order, I didn't realise it would have such an effect on the performance of queries. I've tried it before and it made very little difference on the queries I tried, but for this query it made a big difference.
July 11, 2019 at 11:20 am
This question got asked in two places (btw, please don't do that). I answered on the other one with this:
Here is one article and here is another. Between the two, we outline a bunch of ways to solve the versioned data approach. The core of it is, you have to sort the data and the top values.
The two articles include full-blown comparisons of performance and explanations as to why each different method works (or doesn't).
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply