January 11, 2008 at 7:20 am
My approach to selecting only the latest (most recent) entry where multiple
entries (employees) might exist (and might have duplicate dates) has been:
/*
(Note, the following is non-executable pseudo code)
Sorry, I don't have AdventureWorks installed.
*/
WITH SEQ AS
(SELECT ROW_NUMBER ()
OVER (PARTITION BY SSN
ORDER BY [Date] DESC)AS RECENT
,[SSN]
,[Date]
FROM [EmplData]
)
SELECT Emp.*
FROM [EmplData]AS Emp
INNER JOIN SEQ
ON Emp.SSN = SEQ.SSN
AND Emp.[Date] = SEQ.[Date]
WHERE SEQ.RECENT = 1
/*
I would just as soon avoid the join, and just do:
*/
SELECT ROW_NUMBER ()
OVER (PARTITION BY SSN
ORDER BY [Date] DESC)AS RECENT
,[SSN]
,[Date]
FROM [EmplData]AS Emp
where RECENT = 1
--BUT, as would be expected:
Msg 207, Level 16, State 1, Line 7
Invalid column name 'RECENT'.
/*
As is usually necessary to avoid using a column alias in
the WHERE clause, I attempted the following:
*/
SELECT *
FROM [EmplData]AS Emp
where ROW_NUMBER ()
OVER (PARTITION BY SSN
ORDER BY [Date] DESC) = 1
--BUT, alas:
Msg 4108, Level 15, State 1, Line 1
Windowed functions can only appear in the SELECT or ORDER BY clauses.
Does any expert have a clean/efficient approach to this requirement?
January 11, 2008 at 7:32 am
Why are you resorting to a RANKing function, period ?
Do you have cases where 2 records are tied based on SSN and Date ?
Typically this would be a join to a derived table, where the derived table gives you the MAX() in each set:
SELECT emp.*
FROM Employee As emp
INNER JOIN
(
SELECT SSN, MAX(Date) As MostRecentDate
FROM Employee
GROUP BY SSN
) dt
ON (dt.SSN = emp.SSN AND
dt.MostRecentDate = emp.Date)
January 11, 2008 at 7:32 am
What about:
SELECT x.RECENT
, x.[SSN]
, x.[Date]
FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY SSN ORDER BY [Date] DESC ) AS RECENT
, [SSN]
, [Date]
FROM [EmplData] AS Emp
) AS x
WHERE x.RECENT = 1
Andras
January 11, 2008 at 7:42 am
I agree with Andras. Also - if you simply add in the extra fields from Employee that you need to actually see in the final query to the CTE definition - you shouldn't need a join back to Employee. Should make things quite a bit more efficient.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
January 11, 2008 at 7:50 am
PW, thanks.
This is one of those cases where I am unwilling to rely on the fact that I "know" that (today) there should be no records that are duplicates with both SSN and date. Because I also "know" that an assumption based on today's knowledge are likely to become false tomorrow, and result in spurious duplicates from the join.
Just being a skeptic, I guess.
Andras, thanks also!. Your approach strikes me as the "professional" way to do it. I'm going to compare the query plans and see if the engine can take advantage of any differences between the two queries.
(And Matt, I really don't use the *, but without your reminder I would probably return more columns than I really needed. Thanks!)
I'm sure glad you "smart guys" hang around this very valuable site!
January 11, 2008 at 8:04 am
Actually Jim -
I was just trying to point out that you could avoid the "self join", by adding in the display fields you wanted into the CTE declaration. That way you don't need to declare the CTE, then link back into Employee to get more data.
So - for example, if you were looking to also pull the name and address stuff for the latest SSN, you could simply do:
WITH SEQ (recent,SSN,name, address, date) as
(
SELECT ROW_NUMBER ()
OVER (PARTITION BY SSN
ORDER BY [Date] DESC) AS RECENT
,[SSN]
, [name]
,[address]
,[Date]
FROM [EmplData]
)
select *
from SEQ
WHERE recent=1
No loop back into the original employee table.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
January 11, 2008 at 8:10 am
Thanks Matt! And that also avoids the confusion created by the sub-query (although it makes it T-SQL specific). Yet a third approach to analyze and time.
January 11, 2008 at 8:17 am
Jim Russell (1/11/2008)
Thanks Matt! And that also avoids the confusion created by the sub-query (although it makes it T-SQL specific). Yet a third approach to analyze and time.
And it is very very likely that the execution plan for the above CTE and the non-CTE versions will be exactly the same (it seems to be the case with a toy database).
Regards,
Andras
January 11, 2008 at 9:04 am
Andras -- you are right again.
The real surprise, which at first made me worry I was doing something dumb, was that the version with only the two fields of interest in the SEQ file and the join was the fastest, in both the execution plan estimate and actual:
Q1 - 49% --Q1 Using sub query which returns all fields (*) from target
Q2 - 2% --Q2 just ssn and date in seq with inner join against all fields (*) of target
Q3 - 49% --Q3 All fields (*) from target in SEQ, select * from seq
(Q3 did not take advantage of Matts' suggestion to limit the columns, just to get an even playing field. I'm sure that his was good advice, although in this case I needed to duplicate the target file completely, just eliminate the extraneous records.)
The Q2 difference (which is where I started) was flattering, until I realized that either of two things were going on:
1. I was still doing something dumb (always a strong possibility)
2. The query was able to take advantage of the indexes I had on the "SSN" and "Date" columns. (Which I neglected to mention.)
My timing strategy was to combine all three forms separated with "GO"s in a single query. But, if anyone is willing to spend more time on this (even if just to prove option 1 above), I would be glad to email them the actual statement I used (this time with actual table and column names) and the resulting estimated and actual execution plans.
Again, everybody's help has been invaluable.
January 11, 2008 at 9:01 pm
You CANNOT rely on the "% of Batch" listed in estimated or actual execution plans to determine which query is best... it can "lie" because it can't always see everything...
To demonstrate, let's create a rather large test table...
--drop table jbmtest
--===== Create and populate a 1,000,000 row test table.
-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns t1,
Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.JBMTest
ADD CONSTRAINT PK_JBMTest_RowNum PRIMARY KEY CLUSTERED (RowNum)
Now, some test code...
SELECT TOP 10 *
FROM dbo.JBMTest
ORDER BY SomeInt
SET ROWCOUNT 10
SELECT *
FROM dbo.JBMTest
ORDER BY SomeInt
SET ROWCOUNT 0
Notice that both the estimated and the actual execution plans are identical for both queries and that both say they represent 50% of the batch... run each one separately and see which one is the fastest... the first query runs in about 3 seconds on my machine... the second takes 8 seconds...
Like I said, the execution plan "lies" insofar as % of Batch and CANNOT be used to determine the most effective query.
--Jeff Moden
Change is inevitable... Change for the better is not.
January 14, 2008 at 4:38 am
Damn -- I thought I had just learned a quicker way to check query performance. Just can't trust anything any more.
Thanks!
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply