Measuring Performance

Question

Measuring Performance

SSC Guru

Points: 736160

April 6, 2006 at 4:50 pm

How do you measure the performance of two queries? In an abstract way, how do you determine what works better or is the better solution? In case you missed it, that's the poll question and if you want to specifically answer, keep reading.

I got this from someone recently as a quiz for prospective DBAs.


SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

SET ANSI_PADDING ON

GO

CREATE TABLE [dbo].[quiz1]([name] [varchar](50) NOT NULL, [start_dt] [char](8) NOT NULL, 

[end_dt] [char](8) NOT NULL, [rownum] [int] IDENTITY(1,1) NOT NULL,

 CONSTRAINT [PK_quiz1] PRIMARY KEY CLUSTERED 

( [name] ASC, [start_dt] ASC, [end_dt] ASC )

) ON [PRIMARY]

GO

SET ANSI_PADDING OFF

go

set nocount on

insert into quiz1(name, start_dt, end_dt) values('Brian', '19961001', '19981230')

insert into quiz1(name, start_dt, end_dt) values('Brian', '19981230', '19990605')

insert into quiz1(name, start_dt, end_dt) values('Brian', '19990605', '19991002')

insert into quiz1(name, start_dt, end_dt) values('Brian', '20000201', '20000301')

insert into quiz1(name, start_dt, end_dt) values('Brian', '20000501', '99991231')

insert into quiz1(name, start_dt, end_dt) values('Charles', '19910106', '19910731')

insert into quiz1(name, start_dt, end_dt) values('Charles', '19910731', '19940201')

insert into quiz1(name, start_dt, end_dt) values('Charles', '19940201', '19941021')

insert into quiz1(name, start_dt, end_dt) values('Charles', '19941021', '19961031')

insert into quiz1(name, start_dt, end_dt) values('Charles', '19990501', '20000331')

insert into quiz1(name, start_dt, end_dt) values('Charles', '20000331', '99991231')

insert into quiz1(name, start_dt, end_dt) values('John', '19980103', '19980727')

insert into quiz1(name, start_dt, end_dt) values('John', '19980727', '20000103')

insert into quiz1(name, start_dt, end_dt) values('John', '20000103', '20000601')

insert into quiz1(name, start_dt, end_dt) values('John', '20000601', '99991231')

set nocount off

go

It represents some employment periods. The employees left and came back and the question revolves around finding the time periods during which the person was unemployed. The images below show the areas you are trying to find.

Now I know that most of you can solve this. There are a variety of ways, but I was interested in two solutions. One was submitted by a prospective DBA and one was mentioned by the company as being "better." Here are the two solutions, not saying which is which.


-- Solution 1

select a.name 'Emp_Name', a.end_dt 'StartDate', b.start_dt 'EndDate'

 from   quiz1 a, 

       quiz1 b

 where  b.rownum  = a.rownum +1

 and    b.name    = a.name

 and    a.end_dt != b.start_dt

go

-- Solution 2

select name, end_r as start_date, start_l as end_date

from (select name, 

max(case cnt when 1 then start_dt end) as start_l, 

max(case cnt when 1 then end_dt end) as end_l, 

max(case cnt when 2 then start_dt end) as start_r, 

max(case cnt when 2 then end_dt end) as end_r

     from (select name, start_dt, end_dt, rownum AS rn

                  from quiz1

          ) x cross join (select 1 as cnt union all select 2) y

     where y.cnt <= 2

     group by name, case cnt when 1 then rn else rn+1 end

  ) a

where end_r <> start_l

So which one do you think is better? How do you define better? It's an interesting question to me because I think the answer is ... and I hate this ... it depends.

Steve Jones

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

David A. Long SSCrazy Eights Points: 8421 More actions · Answer 1

OK, hows this?

Note: Table creation modified to prove query does not require the rownum to be sequential or a column in the table.

CREATE TABLE quiz1 (name varchar(50) NULL, start_dt datetime NULL, end_dt datetime NULL) ON 'PRIMARY'

GO

SET NOCOUNT ON

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19961001', '19981230')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19981230', '19990605')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '20000501', '99991231')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19910731', '19940201')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19940201', '19941021')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19990501', '20000331')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '20000331', '99991231')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '19980727', '20000103')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '20000601', '99991231')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19990605', '19991002')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19941021', '19961031')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '19980103', '19980727')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '20000201', '20000301')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19910106', '19910731')

INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '20000103', '20000601')

SET NOCOUNT OFF

GO

PRINT 'Solution 3'

SELECT a.name AS Emp_Name, a.end_dt AS StartDate, b.start_dt AS EndDate

FROM (SELECT COUNT(*) AS rownum, q1.name, q1.start_dt, q1.end_dt FROM quiz1 AS q1

INNER JOIN quiz1 AS q2 ON q1.name = q2.name

AND q1.start_dt > q2.start_dt

GROUP BY q1.name, q1.start_dt, q1.end_dt) AS a

INNER JOIN (SELECT COUNT(*) AS rownum, q1.name, q1.start_dt, q1.end_dt FROM quiz1 AS q1

INNER JOIN quiz1 AS q2 ON q1.name = q2.name

AND q1.start_dt > q2.start_dt

GROUP BY q1.name, q1.start_dt, q1.end_dt) AS b

ON a.rownum+1 = b.rownum

AND a.name = b.name

AND a.end_dt != b.start_dt

GO

Andy

David le Quesne SSCertifiable Points: 5229 More actions · Answer 2

Ok,

My logic was that if there was a gap in employment, then the end date of the previous employment period must be less than the start of the next employment period. I ignored rownumber because although it is an identity column, you can't rely on the fact that the records were inserted in date order for all cases.

Maybe I am being a bit simplistic and inefficient though, as I am using three copies of quiz1, maybe I could eliminate one by cunning use of GROUP BY and HAVING

anyhow I got the following...

select q1.name,q1.end_dt 'From', q2.start_dt 'To'

from quiz1 q1

join quiz1 q2

on (q1.name=q2.name

and q1.end_dt < q2.start_dt

and q2.start_dt = (select min(start_dt) from quiz1 where name=q1.name and start_dt>= q1.end_dt))

order by q1.name,q1.end_dt

David

PS: Is that a new bike Steve?

If it ain't broke, don't fix it...

Vladan SSC-Insane Points: 22072 More actions · Answer 3

Hard to answer the question in article... I don't like either of the two variants of solution. Generally, I would prefer Solution1 because of its simplicity, but in my opinion it is incorrect because it only works when entries are made in certain order. In reality, you can never rely on that.

I'm not sure whether I haven't missed something, but it seems to me that the answer is as simple as

SELECT a.[name], a.end_dt as unempl_start, MIN(b.start_dt) as unempl_end

FROM quiz1 a

JOIN quiz1 b ON b.[name]=a.[name] AND b.start_dt >= a.end_dt

GROUP BY a.[name], a.end_dt

HAVING MIN(b.start_dt) <> a.end_dt

The only assumptions I have made is, that two employment periods are never concurrent (i.e. there can not be two start dates with no end date between them for the given employee). If that's not true, the query will give wrong results.

EDIT : Only now I've looked at David's post and it seems we have used the same logic, it just happened that I managed to implement the GROUP BY and HAVING David is mentioning in his post.

Michael-284794 SSC Eights! Points: 962 More actions · Answer 4

In terms of the question posed, I'd have to say I prefer the first one. I'm a big fan of keeping things as simple as possible, provided that you cover all possible datatable scenarios. In this case, (as Vladan mentions) this means that the solution needs to be modified to group the data so that the employment periods that you're trying to match are always sequential (ie., grouping primarily on name, and sub-grouping on start time).

Given that the first solution doesn't cover the scenario of non-sequential data, it could be argued that the second solution is "better" - they both perform adequately on the datatable described, but the second one will continue to perform even if the data comes in a different order (a scenario that we could regard as likely to happen). But I'd far rather see the simplicity of the first one, as described in Vladan's solution.

(As an aside, note that if we assume that an employee cannot have concurrent periods of employment, we need only use one of start or end time, and they will result in the same grouping. I have chosen start time to cover the possibility that the employee is currently employed, and thus his most recent period has a start but no end, which could result in incorrect order if we only group on end time).

-----------------

C8H10N4O2

Michael-284794 SSC Eights! Points: 962 More actions · Answer 5

Actually, looking at the datatable again, it seems to me that this data is a bit unrealistic, in that consecutive periods of employment start and end on the same day. I would expect that one period would end the day before the next period started, eg "Feb 21 - Mar 17", "Mar 18 - May 25".

In this (IMO, more realistic) scenario, we'd have to adjust all the queries to check instead for cases where b.start_dt was more than one day after a.end_dt.

I realise it's just a hypothetical exercise for the purposes of a test, but If I were grading the responses, that observation would certainly get bonus marks...

-----------------

C8H10N4O2

Vladan SSC-Insane Points: 22072 More actions · Answer 6

Thanks Michael for catching the grouping, I forgot to clear my GROUP BY after I removed columns I had there for verification purposes. It really is enough to group on one date column, either start or date. Query edited. I used end date for grouping, otherwise I'd have to rewrite the query... in the supplied data, there is always end date '99991231' for those rows that show current employment (that was an assumption, too, but I forgot to mention it in my post) - but I have to agree that start date would probably be more reliable.

As to the end and new start on the same date, well, that depends on how the periods are defined. I've seen it both ways in reality. There is even a possibility that the new employment could start on the same day OR the next day, and both would be considered uniterrupted employment... maybe even if the old period ends on Friday and new starts on Monday! However, in this respect I think that the definition was precise enough for test purposes.

Bryant McClellan SSCarpal Tunnel Points: 4397 More actions · Answer 7

Given the observation that the first solution relies on the ordering of the data, the second solution is the better one as it is more complete.

But lets suppose that you don't have the luxury of modifying the underlying tables to arrive at a more elegant solution. Perhaps, for some obscure reason lost to the mists of time, the row ordering is required because other parts of the database or business solution depend on it. In that case, both solutions are correct, but that still leaves us with the original question.

From a maintenance (or should I say a debugging at 2:30 am) perspective, I'd prefer solution one. Surgery is going to be easier in this case because the query is simpler.

If the solutions are agreed to be accurate given the limitations of the original DDL, and the performance is roughly equivalent, then Occam's Razor applies and the first solution is better.

------------
Buy the ticket, take the ride. -- Hunter S. Thompson

jay-h SSCoach Points: 18816 More actions · Answer 8

When there's a tradoff between simplicity and obvious function and performance perhaps with obscure functionality the choice can be determined by eventual use.

If it is a rarely performed function, simplicity and and obviousness will be much more helpful especially if others need to be able to use the code as well.

If it's in daily use, on big data sets then the opposite may be true.

...

-- FORTRAN manual for Xerox Computers --

ofcoral Right there with Babe Points: 784 More actions · Answer 9

Very interesting queries were posted here

Here is another one

select * from

(

select n1.name,n1.end_dt start_unemp

, (select min(n2.start_dt) from quiz1 n2 where N1.NAME=n2.name and n2.start_dt>n1.start_dt) End_unemp

from quiz1 n1

) kk

where start_unemp<>end_unemp

joshusch SSC-Addicted Points: 444 More actions · Answer 10

This has been said already, and alternate examples given, but the first thing that jumped out at me is that both given solutions rely on the rownum column, and therefore rely on the data being ordered by name, start_dt. I would say that neither solution is any good, then, and here's why. I understand that this is a hypothetical example, but I'll even play along. Given the same hypothetical example, do the initial inserts. Now, pretend that Briand quits. You'll have to update his end_dt, as follows:

update dbo.quiz1

set [end_dt] = '20060407'

where [name] = 'Brian'

and [end_dt] = '99991231'

Now, suppose he comes back in a week:

insert into quiz1(name, start_dt, end_dt) values('Brian', '20060414', '99991231')

Neither of the given solutions will see this week-long unemployment period. Therefore, I can't recommend either of them. However, any of the solutions posted here that don't rely on rownum will work.

If you're forcing me to choose one, though, I'd pick the first solution. They're both flawed in that they rely on rownum, but one is very easy to read and one is not. The only thing I would change is how the join is defined. Microsoft recommends that the join conditions be defined in the from clause, not in the where clause.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/acdata/ac_8_qd_09_66b7.asp

I learned SQL the non-recommended way, not even 5 years ago, but quickly changed how I defined my joins once I saw the other way because it's so much easier to tell how tables are joined. For joining 2 tables it's not bad, but when you join 15 tables, it's a lot more obvious which way is easier to read. If you still define your join conditions in the where clause, I challenge you to try to learn the recommended way. It doesn't really take long to figure it out and once you do, you'll never go back. I promise...

Mike Good SSCertifiable Points: 7415 More actions · Answer 11

Contrary to several above posts, both solutions in original post rely on ordering of the input data.

Given that, I think the Solution 1 is better just because it's simpler, more readable, and does not contain extraneous code. Efficiency is a non-issue with this many rows. I am curious to know which the company thought was better.

Solution 2 has extraneous code complicating it: "(select name, start_dt, end_dt, rownum AS rn from quiz1)" can be replaced with "quiz1", replacing "rn" references with "rownun". And "where y.cnt <= 2" can be removed, serves no purpose. Solution 2 also uses lousy column names for derived tables, not easy to figure out how this works or what author was trying to do.

Only solution that I personally like here is one from Vladan; it is simple, understandable, and does not rely on ordering of the input data.

Martin Vrieze SSCrazy Points: 2760 More actions · Answer 12

Martin Vrieze

SSCrazy

Points: 2760

April 7, 2006 at 8:26 am

#631505

Solution 1 is visually more elegant.

Edmundo Ortiz SSC Enthusiast Points: 136 More actions · Answer 13

Talking only within the given scenario ie without thinking about another better solution or table structure... Solution 1 is the best one... why?

By using SET SHOWPLAN_ALL ON we can get statistics, numbers to compare:

	Total			Average (the max value is the better)
Solution	Estimate Rows	Estimate IO	Estimate CPU	Estimate Rows	Estimate IO	Estimate CPU
1	18.34	0.043907001	0.000237855	1.00	1.00	1.00	<--- This is the better
2	189.586666	0.048839762	0.000784538	0.10	0.90	0.30	<--- This is the worst

		Estimate Rows	It's ~	10.33733184	times better
		Estimate IO	It's ~	1.112345662	times better
		Estimate CPU	It's ~	3.298396596	times better

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 14

The real 'elegance' in this thread is Edmundo's post.

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."