Selecting the "most recent" entry in a table

Question

Post reply

Selecting the "most recent" entry in a table

Jim Russell-390299

SSCrazy Eights

Points: 9333
More actions
January 11, 2008 at 7:20 am

#180554

My approach to selecting only the latest (most recent) entry where multiple
entries (employees) might exist (and might have duplicate dates) has been:
/*
(Note, the following is non-executable pseudo code)
Sorry, I don't have AdventureWorks installed.
*/
WITH SEQ AS
(SELECT ROW_NUMBER ()
OVER (PARTITION BY SSN
ORDER BY [Date] DESC)AS RECENT
,[SSN]
,[Date]
FROM [EmplData]
)
SELECT Emp.*
FROM [EmplData]AS Emp
INNER JOIN SEQ
ON Emp.SSN = SEQ.SSN
AND Emp.[Date] = SEQ.[Date]
WHERE SEQ.RECENT = 1
/*
I would just as soon avoid the join, and just do:
*/
SELECT ROW_NUMBER ()
OVER (PARTITION BY SSN
ORDER BY [Date] DESC)AS RECENT
,[SSN]
,[Date]
FROM [EmplData]AS Emp
where RECENT = 1
--BUT, as would be expected:
Msg 207, Level 16, State 1, Line 7
Invalid column name 'RECENT'.
/*
As is usually necessary to avoid using a column alias in
the WHERE clause, I attempted the following:
*/
SELECT *
FROM [EmplData]AS Emp
where ROW_NUMBER ()
OVER (PARTITION BY SSN
ORDER BY [Date] DESC) = 1
--BUT, alas:
Msg 4108, Level 15, State 1, Line 1
Windowed functions can only appear in the SELECT or ORDER BY clauses.
Does any expert have a clean/efficient approach to this requirement?

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

PW-201837 SSC-Insane Points: 20805 More actions · Answer 1

Why are you resorting to a RANKing function, period ?

Do you have cases where 2 records are tied based on SSN and Date ?

Typically this would be a join to a derived table, where the derived table gives you the MAX() in each set:

SELECT emp.*

FROM Employee As emp

INNER JOIN

(

SELECT SSN, MAX(Date) As MostRecentDate

FROM Employee

GROUP BY SSN

) dt

ON (dt.SSN = emp.SSN AND

dt.MostRecentDate = emp.Date)

Andras Belokosztolszki SSC-Insane Points: 22031 More actions · Answer 2

What about:

SELECT x.RECENT

, x.[SSN]

, x.[Date]

FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY SSN ORDER BY [Date] DESC ) AS RECENT

, [SSN]

, [Date]

FROM [EmplData] AS Emp

) AS x

WHERE x.RECENT = 1

Andras

Andras Belokosztolszki, MCPD, PhD
GoldenGate Software

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 3

I agree with Andras. Also - if you simply add in the extra fields from Employee that you need to actually see in the final query to the CTE definition - you shouldn't need a join back to Employee. Should make things quite a bit more efficient.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jim Russell-390299 SSCrazy Eights Points: 9333 More actions · Answer 4

PW, thanks.

This is one of those cases where I am unwilling to rely on the fact that I "know" that (today) there should be no records that are duplicates with both SSN and date. Because I also "know" that an assumption based on today's knowledge are likely to become false tomorrow, and result in spurious duplicates from the join.

Just being a skeptic, I guess.

Andras, thanks also!. Your approach strikes me as the "professional" way to do it. I'm going to compare the query plans and see if the engine can take advantage of any differences between the two queries.

(And Matt, I really don't use the *, but without your reminder I would probably return more columns than I really needed. Thanks!)

I'm sure glad you "smart guys" hang around this very valuable site!

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 5

Actually Jim -

I was just trying to point out that you could avoid the "self join", by adding in the display fields you wanted into the CTE declaration. That way you don't need to declare the CTE, then link back into Employee to get more data.

So - for example, if you were looking to also pull the name and address stuff for the latest SSN, you could simply do:

WITH SEQ (recent,SSN,name, address, date) as

(

SELECT ROW_NUMBER ()

OVER (PARTITION BY SSN

ORDER BY [Date] DESC) AS RECENT

,[SSN]

, [name]

,[address]

,[Date]

FROM [EmplData]

)

select *

from SEQ

WHERE recent=1

No loop back into the original employee table.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jim Russell-390299 SSCrazy Eights Points: 9333 More actions · Answer 6

Thanks Matt! And that also avoids the confusion created by the sub-query (although it makes it T-SQL specific). Yet a third approach to analyze and time.

Andras Belokosztolszki SSC-Insane Points: 22031 More actions · Answer 7

Jim Russell (1/11/2008)
Thanks Matt! And that also avoids the confusion created by the sub-query (although it makes it T-SQL specific). Yet a third approach to analyze and time.

And it is very very likely that the execution plan for the above CTE and the non-CTE versions will be exactly the same (it seems to be the case with a toy database).

Regards,

Andras

Andras Belokosztolszki, MCPD, PhD
GoldenGate Software

Jim Russell-390299 SSCrazy Eights Points: 9333 More actions · Answer 8

Andras -- you are right again.

The real surprise, which at first made me worry I was doing something dumb, was that the version with only the two fields of interest in the SEQ file and the join was the fastest, in both the execution plan estimate and actual:

Q1 - 49% --Q1 Using sub query which returns all fields (*) from target

Q2 - 2% --Q2 just ssn and date in seq with inner join against all fields (*) of target

Q3 - 49% --Q3 All fields (*) from target in SEQ, select * from seq

(Q3 did not take advantage of Matts' suggestion to limit the columns, just to get an even playing field. I'm sure that his was good advice, although in this case I needed to duplicate the target file completely, just eliminate the extraneous records.)

The Q2 difference (which is where I started) was flattering, until I realized that either of two things were going on:

1. I was still doing something dumb (always a strong possibility)

2. The query was able to take advantage of the indexes I had on the "SSN" and "Date" columns. (Which I neglected to mention.)

My timing strategy was to combine all three forms separated with "GO"s in a single query. But, if anyone is willing to spend more time on this (even if just to prove option 1 above), I would be glad to email them the actual statement I used (this time with actual table and column names) and the resulting estimated and actual execution plans.

Again, everybody's help has been invaluable.

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 9

You CANNOT rely on the "% of Batch" listed in estimated or actual execution plans to determine which query is best... it can "lie" because it can't always see everything...

To demonstrate, let's create a rather large test table...

--drop table jbmtest

--===== Create and populate a 1,000,000 row test table.

-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

-- for all rows.

-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

-- Jeff Moden

SELECT TOP 1000000

RowNum = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),

SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

SomeHex12 = RIGHT(NEWID(),12)

INTO dbo.JBMTest

FROM Master.dbo.SysColumns t1,

Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

--===== A table is not properly formed unless a Primary Key has been assigned

ALTER TABLE dbo.JBMTest

ADD CONSTRAINT PK_JBMTest_RowNum PRIMARY KEY CLUSTERED (RowNum)

Now, some test code...

SELECT TOP 10 *

FROM dbo.JBMTest

ORDER BY SomeInt

SET ROWCOUNT 10

SELECT *

FROM dbo.JBMTest

ORDER BY SomeInt

SET ROWCOUNT 0

Notice that both the estimated and the actual execution plans are identical for both queries and that both say they represent 50% of the batch... run each one separately and see which one is the fastest... the first query runs in about 3 seconds on my machine... the second takes 8 seconds...

Like I said, the execution plan "lies" insofar as % of Batch and CANNOT be used to determine the most effective query.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jim Russell-390299 SSCrazy Eights Points: 9333 More actions · Answer 10

Damn -- I thought I had just learned a quicker way to check query performance. Just can't trust anything any more.

Thanks!