April 6, 2006 at 4:50 pm
How do you measure the performance of two queries? In an abstract way, how do you determine what works better or is the better solution? In case you missed it, that's the poll question and if you want to specifically answer, keep reading.
I got this from someone recently as a quiz for prospective DBAs.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[quiz1]([name] [varchar](50) NOT NULL, [start_dt] [char](8) NOT NULL,
[end_dt] [char](8) NOT NULL, [rownum] [int] IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK_quiz1] PRIMARY KEY CLUSTERED
( [name] ASC, [start_dt] ASC, [end_dt] ASC )
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
go
set nocount on
insert into quiz1(name, start_dt, end_dt) values('Brian', '19961001', '19981230')
insert into quiz1(name, start_dt, end_dt) values('Brian', '19981230', '19990605')
insert into quiz1(name, start_dt, end_dt) values('Brian', '19990605', '19991002')
insert into quiz1(name, start_dt, end_dt) values('Brian', '20000201', '20000301')
insert into quiz1(name, start_dt, end_dt) values('Brian', '20000501', '99991231')
insert into quiz1(name, start_dt, end_dt) values('Charles', '19910106', '19910731')
insert into quiz1(name, start_dt, end_dt) values('Charles', '19910731', '19940201')
insert into quiz1(name, start_dt, end_dt) values('Charles', '19940201', '19941021')
insert into quiz1(name, start_dt, end_dt) values('Charles', '19941021', '19961031')
insert into quiz1(name, start_dt, end_dt) values('Charles', '19990501', '20000331')
insert into quiz1(name, start_dt, end_dt) values('Charles', '20000331', '99991231')
insert into quiz1(name, start_dt, end_dt) values('John', '19980103', '19980727')
insert into quiz1(name, start_dt, end_dt) values('John', '19980727', '20000103')
insert into quiz1(name, start_dt, end_dt) values('John', '20000103', '20000601')
insert into quiz1(name, start_dt, end_dt) values('John', '20000601', '99991231')
set nocount off
go
It represents some employment periods. The employees left and came back and the question revolves around finding the time periods during which the person was unemployed. The images below show the areas you are trying to find.
Now I know that most of you can solve this. There are a variety of ways, but I was interested in two solutions. One was submitted by a prospective DBA and one was mentioned by the company as being "better." Here are the two solutions, not saying which is which.
-- Solution 1
select a.name 'Emp_Name', a.end_dt 'StartDate', b.start_dt 'EndDate'
from quiz1 a,
quiz1 b
where b.rownum = a.rownum +1
and b.name = a.name
and a.end_dt != b.start_dt
go
-- Solution 2
select name, end_r as start_date, start_l as end_date
from (select name,
max(case cnt when 1 then start_dt end) as start_l,
max(case cnt when 1 then end_dt end) as end_l,
max(case cnt when 2 then start_dt end) as start_r,
max(case cnt when 2 then end_dt end) as end_r
from (select name, start_dt, end_dt, rownum AS rn
from quiz1
) x cross join (select 1 as cnt union all select 2) y
where y.cnt <= 2
group by name, case cnt when 1 then rn else rn+1 end
) a
where end_r <> start_l
So which one do you think is better? How do you define better? It's an interesting question to me because I think the answer is ... and I hate this ... it depends.
Steve Jones
April 7, 2006 at 1:59 am
OK, hows this?
Note: Table creation modified to prove query does not require the rownum to be sequential or a column in the table.
CREATE TABLE quiz1 (name varchar(50) NULL, start_dt datetime NULL, end_dt datetime NULL) ON 'PRIMARY'
GO
SET NOCOUNT ON
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19961001', '19981230')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19981230', '19990605')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '20000501', '99991231')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19910731', '19940201')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19940201', '19941021')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19990501', '20000331')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '20000331', '99991231')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '19980727', '20000103')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '20000601', '99991231')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19990605', '19991002')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19941021', '19961031')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '19980103', '19980727')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '20000201', '20000301')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19910106', '19910731')
INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '20000103', '20000601')
SET NOCOUNT OFF
GO
PRINT 'Solution 3'
SELECT a.name AS Emp_Name, a.end_dt AS StartDate, b.start_dt AS EndDate
FROM (SELECT COUNT(*) AS rownum, q1.name, q1.start_dt, q1.end_dt FROM quiz1 AS q1
INNER JOIN quiz1 AS q2 ON q1.name = q2.name
AND q1.start_dt > q2.start_dt
GROUP BY q1.name, q1.start_dt, q1.end_dt) AS a
INNER JOIN (SELECT COUNT(*) AS rownum, q1.name, q1.start_dt, q1.end_dt FROM quiz1 AS q1
INNER JOIN quiz1 AS q2 ON q1.name = q2.name
AND q1.start_dt > q2.start_dt
GROUP BY q1.name, q1.start_dt, q1.end_dt) AS b
ON a.rownum+1 = b.rownum
AND a.name = b.name
AND a.end_dt != b.start_dt
GO
Andy
April 7, 2006 at 2:25 am
Ok,
My logic was that if there was a gap in employment, then the end date of the previous employment period must be less than the start of the next employment period. I ignored rownumber because although it is an identity column, you can't rely on the fact that the records were inserted in date order for all cases.
Maybe I am being a bit simplistic and inefficient though, as I am using three copies of quiz1, maybe I could eliminate one by cunning use of GROUP BY and HAVING
anyhow I got the following...
select q1.name,q1.end_dt 'From', q2.start_dt 'To'
from quiz1 q1
join quiz1 q2
on (q1.name=q2.name
and q1.end_dt < q2.start_dt
and q2.start_dt = (select min(start_dt) from quiz1 where name=q1.name and start_dt>= q1.end_dt))
order by q1.name,q1.end_dt
David
PS: Is that a new bike Steve?
If it ain't broke, don't fix it...
April 7, 2006 at 4:26 am
Hard to answer the question in article... I don't like either of the two variants of solution. Generally, I would prefer Solution1 because of its simplicity, but in my opinion it is incorrect because it only works when entries are made in certain order. In reality, you can never rely on that.
I'm not sure whether I haven't missed something, but it seems to me that the answer is as simple as
SELECT a.[name], a.end_dt as unempl_start, MIN(b.start_dt) as unempl_end
FROM quiz1 a
JOIN quiz1 b ON b.[name]=a.[name] AND b.start_dt >= a.end_dt
GROUP BY a.[name], a.end_dt
HAVING MIN(b.start_dt) <> a.end_dt
The only assumptions I have made is, that two employment periods are never concurrent (i.e. there can not be two start dates with no end date between them for the given employee). If that's not true, the query will give wrong results.
EDIT : Only now I've looked at David's post and it seems we have used the same logic, it just happened that I managed to implement the GROUP BY and HAVING David is mentioning in his post.
April 7, 2006 at 5:12 am
In terms of the question posed, I'd have to say I prefer the first one. I'm a big fan of keeping things as simple as possible, provided that you cover all possible datatable scenarios. In this case, (as Vladan mentions) this means that the solution needs to be modified to group the data so that the employment periods that you're trying to match are always sequential (ie., grouping primarily on name, and sub-grouping on start time).
Given that the first solution doesn't cover the scenario of non-sequential data, it could be argued that the second solution is "better" - they both perform adequately on the datatable described, but the second one will continue to perform even if the data comes in a different order (a scenario that we could regard as likely to happen). But I'd far rather see the simplicity of the first one, as described in Vladan's solution.
(As an aside, note that if we assume that an employee cannot have concurrent periods of employment, we need only use one of start or end time, and they will result in the same grouping. I have chosen start time to cover the possibility that the employee is currently employed, and thus his most recent period has a start but no end, which could result in incorrect order if we only group on end time).
-----------------
C8H10N4O2
April 7, 2006 at 5:23 am
Actually, looking at the datatable again, it seems to me that this data is a bit unrealistic, in that consecutive periods of employment start and end on the same day. I would expect that one period would end the day before the next period started, eg "Feb 21 - Mar 17", "Mar 18 - May 25".
In this (IMO, more realistic) scenario, we'd have to adjust all the queries to check instead for cases where b.start_dt was more than one day after a.end_dt.
I realise it's just a hypothetical exercise for the purposes of a test, but If I were grading the responses, that observation would certainly get bonus marks...
-----------------
C8H10N4O2
April 7, 2006 at 5:47 am
Thanks Michael for catching the grouping, I forgot to clear my GROUP BY after I removed columns I had there for verification purposes. It really is enough to group on one date column, either start or date. Query edited. I used end date for grouping, otherwise I'd have to rewrite the query... in the supplied data, there is always end date '99991231' for those rows that show current employment (that was an assumption, too, but I forgot to mention it in my post) - but I have to agree that start date would probably be more reliable.
As to the end and new start on the same date, well, that depends on how the periods are defined. I've seen it both ways in reality. There is even a possibility that the new employment could start on the same day OR the next day, and both would be considered uniterrupted employment... maybe even if the old period ends on Friday and new starts on Monday! However, in this respect I think that the definition was precise enough for test purposes.
April 7, 2006 at 5:51 am
Given the observation that the first solution relies on the ordering of the data, the second solution is the better one as it is more complete.
But lets suppose that you don't have the luxury of modifying the underlying tables to arrive at a more elegant solution. Perhaps, for some obscure reason lost to the mists of time, the row ordering is required because other parts of the database or business solution depend on it. In that case, both solutions are correct, but that still leaves us with the original question.
From a maintenance (or should I say a debugging at 2:30 am) perspective, I'd prefer solution one. Surgery is going to be easier in this case because the query is simpler.
If the solutions are agreed to be accurate given the limitations of the original DDL, and the performance is roughly equivalent, then Occam's Razor applies and the first solution is better.
------------
Buy the ticket, take the ride. -- Hunter S. Thompson
April 7, 2006 at 6:48 am
When there's a tradoff between simplicity and obvious function and performance perhaps with obscure functionality the choice can be determined by eventual use.
If it is a rarely performed function, simplicity and and obviousness will be much more helpful especially if others need to be able to use the code as well.
If it's in daily use, on big data sets then the opposite may be true.
...
-- FORTRAN manual for Xerox Computers --
April 7, 2006 at 7:04 am
Very interesting queries were posted here
Here is another one
select * from
(
select n1.name,n1.end_dt start_unemp
, (select min(n2.start_dt) from quiz1 n2 where N1.NAME=n2.name and n2.start_dt>n1.start_dt) End_unemp
from quiz1 n1
) kk
where start_unemp<>end_unemp
April 7, 2006 at 8:14 am
This has been said already, and alternate examples given, but the first thing that jumped out at me is that both given solutions rely on the rownum column, and therefore rely on the data being ordered by name, start_dt. I would say that neither solution is any good, then, and here's why. I understand that this is a hypothetical example, but I'll even play along. Given the same hypothetical example, do the initial inserts. Now, pretend that Briand quits. You'll have to update his end_dt, as follows:
update dbo.quiz1
set [end_dt] = '20060407'
where [name] = 'Brian'
and [end_dt] = '99991231'
Now, suppose he comes back in a week:
insert into quiz1(name, start_dt, end_dt) values('Brian', '20060414', '99991231')
Neither of the given solutions will see this week-long unemployment period. Therefore, I can't recommend either of them. However, any of the solutions posted here that don't rely on rownum will work.
If you're forcing me to choose one, though, I'd pick the first solution. They're both flawed in that they rely on rownum, but one is very easy to read and one is not. The only thing I would change is how the join is defined. Microsoft recommends that the join conditions be defined in the from clause, not in the where clause.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/acdata/ac_8_qd_09_66b7.asp
I learned SQL the non-recommended way, not even 5 years ago, but quickly changed how I defined my joins once I saw the other way because it's so much easier to tell how tables are joined. For joining 2 tables it's not bad, but when you join 15 tables, it's a lot more obvious which way is easier to read. If you still define your join conditions in the where clause, I challenge you to try to learn the recommended way. It doesn't really take long to figure it out and once you do, you'll never go back. I promise...
April 7, 2006 at 8:15 am
Contrary to several above posts, both solutions in original post rely on ordering of the input data.
Given that, I think the Solution 1 is better just because it's simpler, more readable, and does not contain extraneous code. Efficiency is a non-issue with this many rows. I am curious to know which the company thought was better.
Solution 2 has extraneous code complicating it: "(select name, start_dt, end_dt, rownum AS rn from quiz1)" can be replaced with "quiz1", replacing "rn" references with "rownun". And "where y.cnt <= 2" can be removed, serves no purpose. Solution 2 also uses lousy column names for derived tables, not easy to figure out how this works or what author was trying to do.
Only solution that I personally like here is one from Vladan; it is simple, understandable, and does not rely on ordering of the input data.
April 7, 2006 at 8:26 am
Solution 1 is visually more elegant.
April 7, 2006 at 8:55 am
Talking only within the given scenario ie without thinking about another better solution or table structure... Solution 1 is the best one... why?
By using SET SHOWPLAN_ALL ON we can get statistics, numbers to compare:
Total | Average (the max value is the better) | |||||||
Solution | Estimate Rows | Estimate IO | Estimate CPU | Estimate Rows | Estimate IO | Estimate CPU | ||
1 | 18.34 | 0.043907001 | 0.000237855 | 1.00 | 1.00 | 1.00 | <--- This is the better | |
2 | 189.586666 | 0.048839762 | 0.000784538 | 0.10 | 0.90 | 0.30 | <--- This is the worst | |
Estimate Rows | It's ~ | 10.33733184 | times better | |||||
Estimate IO | It's ~ | 1.112345662 | times better | |||||
Estimate CPU | It's ~ | 3.298396596 | times better |
April 7, 2006 at 10:17 am
The real 'elegance' in this thread is Edmundo's post.
RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply