Newest record based on column criteria

Question

Newest record based on column criteria

craig.jenkins

SSCommitted

Points: 1643
More actions
July 10, 2019 at 9:30 am

#3661486

Hi Guys.
Is there a way to select the distinct ref number where the date is the newest date and acc is equal to LEX. So the results would be:

SELECT *
INTO #SampleData2
FROM (
VALUES ('BA', '12345', '2019-01-01', 'LEX'),
('BA', '12345','2019-01-02', 'LEX'),
('BA', '12345','2019-01-03', 'CASH'),
('BA', '12345','2019-01-04', 'CASH'),
('BA', '12345','2019-01-05', ''),
('BA', '12346', '2019-01-01', 'LEX'),
('BA', '12346','2019-01-02', 'LEX'),
('BA', '12346','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12347','2019-01-02', 'LEX'),
('BA', '12347','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12346','2019-01-19', ''),
('BA', '12346','2019-01-20', ''),
('BA', '12347','2019-01-16', 'LEX'),
('BA', '12347','2019-01-19', 'CASH')
) AS d (Name, ref, DATE, ACC);
SELECT * from #SampleData2

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Thom A SSC Guru Points: 99129 More actions · Answer 1

Yes, the common way is using a CTE and ROW_NUMBER or using ROW_NUMBER in the ORDER BY and TOP 1 WITH TIES. There's plenty of examples on the internet on how to do this (and I'm sure a Google would have given you the answer), but:

WITH CTE AS(
    SELECT SD2.Name,
           SD2.ref,
           SD2.DATE,
           SD2.ACC,
           ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX')
SELECT C.[Name],
       C.ref,
       C.[DATE],
       C.ACC
FROM CTE C
WHERE C.RN = 1;

SELECT TOP 1 WITH TIES
       SD2.[Name],
       SD2.ref,
       SD2.[DATE],
       SD2.ACC
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX'
ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC);

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

craig.jenkins SSCommitted Points: 1643 More actions · Answer 2

craig.jenkins

SSCommitted

Points: 1643

July 10, 2019 at 10:38 am

#3661505

Perfect thank you very much

Jonathan AC Roberts SSCoach Points: 18352 More actions · Answer 3

Here's a method that doesn't use ROW_NUMBER()

;WITH CTE AS
(
    SELECT DISTINCT SD2.Name,
           SD2.ref
      FROM #SampleData2 SD2
     WHERE SD2.ACC = 'LEX'
)
SELECT x.*
  FROM CTE
 CROSS APPLY(SELECT TOP(1) *
               FROM #SampleData2 x
              WHERE x.Name = CTE.Name
                AND x.ref = CTE.ref
                AND x.ACC = 'LEX'
              ORDER BY x.DATE DESC) x

drew.allen SSC Guru Points: 76985 More actions · Answer 4

Jonathan AC Roberts wrote:

Here's a method that doesn't use ROW_NUMBER()

It also requires two scans of the table and is approximately twice as expensive as a result. It will always perform worse than the ROW_NUMBER() approach.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jonathan AC Roberts SSCoach Points: 18352 More actions · Answer 5

drew.allen wrote:

Jonathan AC Roberts wrote:
Here's a method that doesn't use ROW_NUMBER()
It also requires two scans of the table and is approximately twice as expensive as a result. It will always perform worse than the ROW_NUMBER() approach. Drew

I'm sure I've had this argument before. The method I suggested is actually faster than the ROW_NUMBER() provided there is an index on ACC, Name, ref and also that data has on average more than about 4 rows for each ACC, Name, ref. The query can then do a seek, which is faster than the large sort needed for ROW_NUMBER().

So no, it won't always perform worse as it sometimes performs better.

Michael L John One Orange Chip Points: 27054 More actions · Answer 6

Just out of curiosity, I created semi-valid test of these three sets of code. I took the code and inserted it into a regular table, and used the GO 111 to insert the same rows. Not a valid distribution of data for an actual test, but it's good enough. It ended up with 12768 rows.

Both queries came up with a recommended index, so I added them.

CREATE CLUSTERED INDEX [IX99_SampleData2]
ON [dbo].[SampleData2] ([ACC])
GO

USE [Junk]
GO
CREATE NONCLUSTERED INDEX [IX98_SampleData2]
ON [dbo].[SampleData2] ([Name],[Ref],[DATE])
GO

Drew's first example returns in 12 ms.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'SampleData2'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Drew's second example returns in 15 ms.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'SampleData2'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Both do a index seek of the clustered index, and a sort. The second one does an additional sort.

Jonathan's code took 16 ms.

Table 'SampleData2'. Scan count 4, logical reads 116, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The execution plan shows two index scans of the non-clustered index, which I expected with the use of the DISTINCT keyword.

Which code won?

From a purely "style" point of view, the use of DISTINCT is more often than not a huge red flag when I have to tune code. I just spent a month re-writing code that fills a set of summary tables for a reporting database that used DISTINCT everywhere. All of the tables contained bad data that was caused by the blind use of distinct, I prefer the row_number method. I may be a bit prejudiced, but...

Michael L John
If you assassinate a DBA, would you pull a trigger?
To properly post on a forum:
http://www.sqlservercentral.com/articles/61537/

Jonathan AC Roberts SSCoach Points: 18352 More actions · Answer 7

Hi Michael, The indexes recommended by SQL Server are not always the best, the index you would need to try is:

CREATE NONCLUSTERED INDEX [IX98_SampleData3]
ON [dbo].[SampleData2] (ACC,[Name],[Ref],[DATE]) INCLUDE (allOtherColumns)

If you paste in the code you used to generate the test data I'll take a look.

drew.allen SSC Guru Points: 76985 More actions · Answer 8

Jonathan AC Roberts wrote:

Hi Michael, The indexes recommended by SQL Server are not always the best, the index you would need to try is:
CREATE NONCLUSTERED INDEX [IX98_SampleData3]
ON [dbo].[SampleData2] (ACC,[Name],[Ref],[DATE]) INCLUDE (allOtherColumns)
If you paste in the code you used to generate the test data I'll take a look.

Agreed. You'll also want to try the following:

CREATE NONCLUSTERED INDEX [IX98_SampleData3]
ON [dbo].[SampleData2] (ACC,[Ref],[DATE] DESC) INCLUDE (allOtherColumns)

Having the date sorted descending in the index means that the sort isn't required in the CTE for the ROW_NUMBER().

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jonathan AC Roberts SSCoach Points: 18352 More actions · Answer 9

I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.

set statistics io, time off
set nocount on
go
DROP TABLE #SampleData2
GO
SELECT *
INTO #SampleData2
FROM (
VALUES ('BA', '12345', '2019-01-01', 'LEX'),
('BA', '12345','2019-01-02', 'LEX'),
('BA', '12345','2019-01-03', 'CASH'),
('BA', '12345','2019-01-04', 'CASH'),
('BA', '12345','2019-01-05', ''),
('BA', '12346', '2019-01-01', 'LEX'),
('BA', '12346','2019-01-02', 'LEX'),
('BA', '12346','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12347','2019-01-02', 'LEX'),
('BA', '12347','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12346','2019-01-19', ''),
('BA', '12346','2019-01-20', ''),
('BA', '12347','2019-01-16', 'LEX'),
('BA', '12347','2019-01-19', 'CASH')
) AS d (Name, ref, DATE, ACC);
GO
insert INTO #SampleData2
SELECT *
FROM (
VALUES ('BA', '12345', '2019-01-01', 'LEX'),
('BA', '12345','2019-01-02', 'LEX'),
('BA', '12345','2019-01-03', 'CASH'),
('BA', '12345','2019-01-04', 'CASH'),
('BA', '12345','2019-01-05', ''),
('BA', '12346', '2019-01-01', 'LEX'),
('BA', '12346','2019-01-02', 'LEX'),
('BA', '12346','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12347','2019-01-02', 'LEX'),
('BA', '12347','2019-01-03', 'CASH'),
('BA', '12346','2019-01-16', 'CASH'),
('BA', '12346','2019-01-19', ''),
('BA', '12346','2019-01-20', ''),
('BA', '12347','2019-01-16', 'LEX'),
('BA', '12347','2019-01-19', 'CASH')
) AS d (Name, ref, DATE, ACC);
GO 10000
--SELECT * from #SampleData2
CREATE NONCLUSTERED INDEX [IX98_SampleData3] ON #SampleData2 (ACC,[Name],[Ref],[DATE]) 

PRINT '

CTE TOP(1) With Index'

set statistics time on

;WITH CTE AS
(
    SELECT DISTINCT SD2.Name,
           SD2.ref
      FROM #SampleData2 SD2
     WHERE SD2.ACC = 'LEX'
)
SELECT x.*
  FROM CTE
 CROSS APPLY(SELECT TOP(1) x.Name,x.ref,x.DATE,x.ACC
               FROM #SampleData2 x
              WHERE x.Name = CTE.Name
                AND x.ref = CTE.ref
                AND x.ACC = 'LEX'
              ORDER BY x.DATE DESC) x
set statistics io, time off
GO
DROP INDEX [IX98_SampleData3] ON #SampleData2

GO
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])
GO
CREATE NONCLUSTERED INDEX [IX98_SampleData2] ON #SampleData2([Name],[Ref],[DATE])
GO

PRINT '

ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN'

set statistics time on
;WITH CTE AS(
    SELECT SD2.Name,
           SD2.ref,
           SD2.DATE,
           SD2.ACC,
           ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX')
SELECT C.[Name],
       C.ref,
       C.[DATE],
       C.ACC
FROM CTE C
WHERE C.RN = 1;

PRINT '

ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC)'

SELECT TOP 1 WITH TIES
       SD2.[Name],
       SD2.ref,
       SD2.[DATE],
       SD2.ACC
FROM #SampleData2 SD2
WHERE SD2.ACC = 'LEX'
ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC);
set statistics io, time off

DROP INDEX [IX99_SampleData2] ON #SampleData2
GO
DROP INDEX [IX98_SampleData2] ON #SampleData2
GO
-- DROP INDEX [IX98_SampleData3] ON #SampleData2

drew.allen SSC Guru Points: 76985 More actions · Answer 10

Jonathan AC Roberts wrote:

I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.

I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries.

Replace this:

CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])

with this:

CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jonathan AC Roberts SSCoach Points: 18352 More actions · Answer 11

drew.allen wrote:

Jonathan AC Roberts wrote:
I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.
I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries. Replace this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])
with this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)
Drew

The indexes I put in were those suggested by Michael which SQL Server had recommended.

The change to the index you suggested made the CTE with ROW_NUMBER query about the same as the CTE with TOP(1) , the TOP(1) WITH TIES is about twice as slow.

I didn't realise that DESC would make any difference with the performance of the index, I thought it would just reduce fragmentation of the index on inserts if the rows are inserted in order.

drew.allen SSC Guru Points: 76985 More actions · Answer 12

Jonathan AC Roberts wrote:

drew.allen wrote:
Jonathan AC Roberts wrote:
I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.
I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries. Replace this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])
with this:
CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)
Drew
The indexes I put in were those suggested by Michael which SQL Server had recommended. The change to the index you suggested made the CTE with ROW_NUMBER query about the same as the CTE with TOP(1) , the TOP(1) WITH TIES is about twice as slow. I didn't realise that DESC would make any difference with the performance of the index, I thought it would just reduce fragmentation of the index on inserts if the rows are inserted in order.

I think it actually increases fragmentation, because records are typically added in increasing date order rather than decreasing date order. It improves efficiency, because the index is organized in the same order as the OVER() clause for the ROW_NUMBER(), so you don't have to sort the results to get the correct ROW_NUMBER().

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

Jonathan AC Roberts SSCoach Points: 18352 More actions · Answer 13

drew.allen wrote:

I think it actually increases fragmentation, because records are typically added in increasing date order rather than decreasing date order. It improves efficiency, because the index is organised in the same order as the OVER() clause for the ROW_NUMBER(), so you don't have to sort the results to get the correct ROW_NUMBER(). Drew

Yes I did mean that, I thought the only reason you would put DESC is to reduce fragmentation if the rows were inserted in descending order, I didn't realise it would have such an effect on the performance of queries. I've tried it before and it made very little difference on the queries I tried, but for this query it made a big difference.

Grant Fritchey SSC Guru Points: 398882 More actions · Answer 14

This question got asked in two places (btw, please don't do that). I answered on the other one with this:

Here is one article and here is another. Between the two, we outline a bunch of ways to solve the versioned data approach. The core of it is, you have to sort the data and the top values.

The two articles include full-blown comparisons of performance and explanations as to why each different method works (or doesn't).

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning