Newest record based on column criteria

  • Hi Guys.

    Is there a way to select the distinct ref number where the date is the newest date and acc is equal to LEX.  So the results would be:

     

     

    SELECT *

    INTO #SampleData2

    FROM (

    VALUES ('BA', '12345', '2019-01-01', 'LEX'),

    ('BA', '12345','2019-01-02', 'LEX'),

    ('BA', '12345','2019-01-03', 'CASH'),

    ('BA', '12345','2019-01-04', 'CASH'),

    ('BA', '12345','2019-01-05', ''),

    ('BA', '12346', '2019-01-01', 'LEX'),

    ('BA', '12346','2019-01-02', 'LEX'),

    ('BA', '12346','2019-01-03', 'CASH'),

    ('BA', '12346','2019-01-16', 'CASH'),

    ('BA', '12347','2019-01-02', 'LEX'),

    ('BA', '12347','2019-01-03', 'CASH'),

    ('BA', '12346','2019-01-16', 'CASH'),

    ('BA', '12346','2019-01-19', ''),

    ('BA', '12346','2019-01-20', ''),

    ('BA', '12347','2019-01-16', 'LEX'),

    ('BA', '12347','2019-01-19', 'CASH')

    ) AS d (Name, ref, DATE, ACC);

    SELECT * from #SampleData2

  • Yes, the common way is using a CTE and ROW_NUMBER or using ROW_NUMBER in the ORDER BY and TOP 1 WITH TIES. There's plenty of examples on the internet on how to do this (and I'm sure a Google would have given you the answer), but:

    WITH CTE AS(
    SELECT SD2.Name,
    SD2.ref,
    SD2.DATE,
    SD2.ACC,
    ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX')
    SELECT C.[Name],
    C.ref,
    C.[DATE],
    C.ACC
    FROM CTE C
    WHERE C.RN = 1;

    SELECT TOP 1 WITH TIES
    SD2.[Name],
    SD2.ref,
    SD2.[DATE],
    SD2.ACC
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX'
    ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC);

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Perfect thank you very much

  • Here's a method that doesn't use ROW_NUMBER()

    ;WITH CTE AS
    (
    SELECT DISTINCT SD2.Name,
    SD2.ref
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX'
    )
    SELECT x.*
    FROM CTE
    CROSS APPLY(SELECT TOP(1) *
    FROM #SampleData2 x
    WHERE x.Name = CTE.Name
    AND x.ref = CTE.ref
    AND x.ACC = 'LEX'
    ORDER BY x.DATE DESC) x
  • Jonathan AC Roberts wrote:

    Here's a method that doesn't use ROW_NUMBER()

    It also requires two scans of the table and is approximately twice as expensive as a result.  It will always perform worse than the ROW_NUMBER() approach.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen wrote:

    Jonathan AC Roberts wrote:

    Here's a method that doesn't use ROW_NUMBER()

    It also requires two scans of the table and is approximately twice as expensive as a result.  It will always perform worse than the ROW_NUMBER() approach. Drew

    I'm sure I've had this argument before. The method I suggested is actually faster than the ROW_NUMBER() provided there is an index on ACC, Name, ref and also that data has on average more than about 4 rows for each ACC, Name, ref. The query can then do a seek, which is faster than the large sort needed for ROW_NUMBER().

    So no, it won't always perform worse as it sometimes performs better.

  • Just out of curiosity, I created semi-valid test of these three sets of code.  I took the code and inserted it into a regular table, and used the GO 111 to insert the same rows.  Not a valid distribution of data for an actual test, but it's good enough.  It ended up with 12768 rows.

    Both queries came up with a recommended index, so I added them.

    CREATE CLUSTERED INDEX [IX99_SampleData2]
    ON [dbo].[SampleData2] ([ACC])
    GO

    USE [Junk]
    GO
    CREATE NONCLUSTERED INDEX [IX98_SampleData2]
    ON [dbo].[SampleData2] ([Name],[Ref],[DATE])
    GO

    Drew's first example returns in 12 ms.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SampleData2'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Drew's second example returns in 15 ms.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SampleData2'. Scan count 1, logical reads 29, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Both do a index seek of the clustered index, and a sort.  The second one does an additional sort.

    Jonathan's code took 16 ms.

    Table 'SampleData2'. Scan count 4, logical reads 116, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    The execution plan shows two index scans of the non-clustered index, which I expected with the use of the DISTINCT keyword.

    Which code won?

    From a purely "style" point of view, the use of DISTINCT is more often than not a huge red flag when I have to tune code.  I just spent a month re-writing code that fills a set of summary tables for a reporting database that used DISTINCT everywhere.  All of the tables contained bad data that was caused by the blind use of distinct, I prefer the row_number method.  I may be a bit prejudiced, but...

     

     

     

     

     

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Hi Michael, The indexes recommended by SQL Server are not always the best, the index you would need to try is:

    CREATE NONCLUSTERED INDEX [IX98_SampleData3]
    ON [dbo].[SampleData2] (ACC,[Name],[Ref],[DATE]) INCLUDE (allOtherColumns)

    If you paste in the code you used to generate the test data I'll take a look.

  • Jonathan AC Roberts wrote:

    Hi Michael, The indexes recommended by SQL Server are not always the best, the index you would need to try is:

    CREATE NONCLUSTERED INDEX [IX98_SampleData3]
    ON [dbo].[SampleData2] (ACC,[Name],[Ref],[DATE]) INCLUDE (allOtherColumns)

    If you paste in the code you used to generate the test data I'll take a look.

    Agreed.  You'll also want to try the following:

    CREATE NONCLUSTERED INDEX [IX98_SampleData3]
    ON [dbo].[SampleData2] (ACC,[Ref],[DATE] DESC) INCLUDE (allOtherColumns)

    Having the date sorted descending in the index means that the sort isn't required in the CTE for the ROW_NUMBER().

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.

    set statistics io, time off
    set nocount on
    go
    DROP TABLE #SampleData2
    GO
    SELECT *
    INTO #SampleData2
    FROM (
    VALUES ('BA', '12345', '2019-01-01', 'LEX'),
    ('BA', '12345','2019-01-02', 'LEX'),
    ('BA', '12345','2019-01-03', 'CASH'),
    ('BA', '12345','2019-01-04', 'CASH'),
    ('BA', '12345','2019-01-05', ''),
    ('BA', '12346', '2019-01-01', 'LEX'),
    ('BA', '12346','2019-01-02', 'LEX'),
    ('BA', '12346','2019-01-03', 'CASH'),
    ('BA', '12346','2019-01-16', 'CASH'),
    ('BA', '12347','2019-01-02', 'LEX'),
    ('BA', '12347','2019-01-03', 'CASH'),
    ('BA', '12346','2019-01-16', 'CASH'),
    ('BA', '12346','2019-01-19', ''),
    ('BA', '12346','2019-01-20', ''),
    ('BA', '12347','2019-01-16', 'LEX'),
    ('BA', '12347','2019-01-19', 'CASH')
    ) AS d (Name, ref, DATE, ACC);
    GO
    insert INTO #SampleData2
    SELECT *
    FROM (
    VALUES ('BA', '12345', '2019-01-01', 'LEX'),
    ('BA', '12345','2019-01-02', 'LEX'),
    ('BA', '12345','2019-01-03', 'CASH'),
    ('BA', '12345','2019-01-04', 'CASH'),
    ('BA', '12345','2019-01-05', ''),
    ('BA', '12346', '2019-01-01', 'LEX'),
    ('BA', '12346','2019-01-02', 'LEX'),
    ('BA', '12346','2019-01-03', 'CASH'),
    ('BA', '12346','2019-01-16', 'CASH'),
    ('BA', '12347','2019-01-02', 'LEX'),
    ('BA', '12347','2019-01-03', 'CASH'),
    ('BA', '12346','2019-01-16', 'CASH'),
    ('BA', '12346','2019-01-19', ''),
    ('BA', '12346','2019-01-20', ''),
    ('BA', '12347','2019-01-16', 'LEX'),
    ('BA', '12347','2019-01-19', 'CASH')
    ) AS d (Name, ref, DATE, ACC);
    GO 10000
    --SELECT * from #SampleData2
    CREATE NONCLUSTERED INDEX [IX98_SampleData3] ON #SampleData2 (ACC,[Name],[Ref],[DATE])

    PRINT '

    CTE TOP(1) With Index'

    set statistics time on

    ;WITH CTE AS
    (
    SELECT DISTINCT SD2.Name,
    SD2.ref
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX'
    )
    SELECT x.*
    FROM CTE
    CROSS APPLY(SELECT TOP(1) x.Name,x.ref,x.DATE,x.ACC
    FROM #SampleData2 x
    WHERE x.Name = CTE.Name
    AND x.ref = CTE.ref
    AND x.ACC = 'LEX'
    ORDER BY x.DATE DESC) x
    set statistics io, time off
    GO
    DROP INDEX [IX98_SampleData3] ON #SampleData2

    GO
    CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])
    GO
    CREATE NONCLUSTERED INDEX [IX98_SampleData2] ON #SampleData2([Name],[Ref],[DATE])
    GO

    PRINT '

    ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN'

    set statistics time on
    ;WITH CTE AS(
    SELECT SD2.Name,
    SD2.ref,
    SD2.DATE,
    SD2.ACC,
    ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC) AS RN
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX')
    SELECT C.[Name],
    C.ref,
    C.[DATE],
    C.ACC
    FROM CTE C
    WHERE C.RN = 1;

    PRINT '

    ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC)'

    SELECT TOP 1 WITH TIES
    SD2.[Name],
    SD2.ref,
    SD2.[DATE],
    SD2.ACC
    FROM #SampleData2 SD2
    WHERE SD2.ACC = 'LEX'
    ORDER BY ROW_NUMBER() OVER (PARTITION BY Ref ORDER BY [DATE] DESC);
    set statistics io, time off

    DROP INDEX [IX99_SampleData2] ON #SampleData2
    GO
    DROP INDEX [IX98_SampleData2] ON #SampleData2
    GO
    -- DROP INDEX [IX98_SampleData3] ON #SampleData2
  • Jonathan AC Roberts wrote:

    I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.

    I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries.

    Replace this:

    CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])

    with this:

    CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen wrote:

    Jonathan AC Roberts wrote:

    I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.

    I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries. Replace this:

    CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])

    with this:

    CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)

    Drew

    The indexes I put in were those suggested by Michael which SQL Server had  recommended.

    The change to the index you suggested made the CTE with ROW_NUMBER query about the same as the CTE with TOP(1) , the  TOP(1) WITH TIES is about twice as slow.

    I didn't realise that DESC would make any difference with the performance of the index, I thought it would just reduce fragmentation of the index on inserts if the rows are inserted in order.

  • Jonathan AC Roberts wrote:

    drew.allen wrote:

    Jonathan AC Roberts wrote:

    I've created some test code, not realistic, but the CTE with SELECT TOP(1) is over ten times faster, though has more logical reads.

    I also notice that you did not define your index using DATE DESC, which vastly improves the performance of both of the ROW_NUMBER() queries. Replace this:

    CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC])

    with this:

    CREATE CLUSTERED INDEX [IX99_SampleData2] ON #SampleData2 ([ACC], REF, DATE DESC)

    Drew

    The indexes I put in were those suggested by Michael which SQL Server had  recommended. The change to the index you suggested made the CTE with ROW_NUMBER query about the same as the CTE with TOP(1) , the  TOP(1) WITH TIES is about twice as slow. I didn't realise that DESC would make any difference with the performance of the index, I thought it would just reduce fragmentation of the index on inserts if the rows are inserted in order.

    I think it actually increases fragmentation, because records are typically added in increasing date order rather than decreasing date order.  It improves efficiency, because the index is organized in the same order as the OVER() clause for the ROW_NUMBER(), so you don't have to sort the results to get the correct ROW_NUMBER().

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen wrote:

    I think it actually increases fragmentation, because records are typically added in increasing date order rather than decreasing date order.  It improves efficiency, because the index is organised in the same order as the OVER() clause for the ROW_NUMBER(), so you don't have to sort the results to get the correct ROW_NUMBER(). Drew

    Yes I did mean that, I thought the only reason you would put DESC is to reduce fragmentation if the rows were inserted in descending order, I didn't realise it would have such an effect on the performance of queries. I've tried it before and it made very little difference on the queries I tried, but for this query it made a big difference.

  • This question got asked in two places (btw, please don't do that). I answered on the other one with this:

    Here is one article and here is another. Between the two, we outline a bunch of ways to solve the versioned data approach. The core of it is, you have to sort the data and the top values.

    The two articles include full-blown comparisons of performance and explanations as to why each different method works (or doesn't).

     

     

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply