Measuring Performance

  • How do you measure the performance of two queries? In an abstract way, how do you determine what works better or is the better solution? In case you missed it, that's the poll question and if you want to specifically answer, keep reading.

    I got this from someone recently as a quiz for prospective DBAs.


    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    SET ANSI_PADDING ON
    GO
    CREATE TABLE [dbo].[quiz1]([name] [varchar](50) NOT NULL, [start_dt] [char](8) NOT NULL,
    [end_dt] [char](8) NOT NULL, [rownum] [int] IDENTITY(1,1) NOT NULL,
    CONSTRAINT [PK_quiz1] PRIMARY KEY CLUSTERED
    ( [name] ASC, [start_dt] ASC, [end_dt] ASC )
    ) ON [PRIMARY]
    GO
    SET ANSI_PADDING OFF
    go
    set nocount on
    insert into quiz1(name, start_dt, end_dt) values('Brian', '19961001', '19981230')
    insert into quiz1(name, start_dt, end_dt) values('Brian', '19981230', '19990605')
    insert into quiz1(name, start_dt, end_dt) values('Brian', '19990605', '19991002')
    insert into quiz1(name, start_dt, end_dt) values('Brian', '20000201', '20000301')
    insert into quiz1(name, start_dt, end_dt) values('Brian', '20000501', '99991231')
    insert into quiz1(name, start_dt, end_dt) values('Charles', '19910106', '19910731')
    insert into quiz1(name, start_dt, end_dt) values('Charles', '19910731', '19940201')
    insert into quiz1(name, start_dt, end_dt) values('Charles', '19940201', '19941021')
    insert into quiz1(name, start_dt, end_dt) values('Charles', '19941021', '19961031')
    insert into quiz1(name, start_dt, end_dt) values('Charles', '19990501', '20000331')
    insert into quiz1(name, start_dt, end_dt) values('Charles', '20000331', '99991231')
    insert into quiz1(name, start_dt, end_dt) values('John', '19980103', '19980727')
    insert into quiz1(name, start_dt, end_dt) values('John', '19980727', '20000103')
    insert into quiz1(name, start_dt, end_dt) values('John', '20000103', '20000601')
    insert into quiz1(name, start_dt, end_dt) values('John', '20000601', '99991231')
    set nocount off
    go

    It represents some employment periods. The employees left and came back and the question revolves around finding the time periods during which the person was unemployed. The images below show the areas you are trying to find.

    Now I know that most of you can solve this. There are a variety of ways, but I was interested in two solutions. One was submitted by a prospective DBA and one was mentioned by the company as being "better." Here are the two solutions, not saying which is which.


    -- Solution 1
    select a.name 'Emp_Name', a.end_dt 'StartDate', b.start_dt 'EndDate'
    from quiz1 a,
    quiz1 b
    where b.rownum = a.rownum +1
    and b.name = a.name
    and a.end_dt != b.start_dt
    go
    -- Solution 2
    select name, end_r as start_date, start_l as end_date
    from (select name,
    max(case cnt when 1 then start_dt end) as start_l,
    max(case cnt when 1 then end_dt end) as end_l,
    max(case cnt when 2 then start_dt end) as start_r,
    max(case cnt when 2 then end_dt end) as end_r
    from (select name, start_dt, end_dt, rownum AS rn
    from quiz1
    ) x cross join (select 1 as cnt union all select 2) y
    where y.cnt <= 2
    group by name, case cnt when 1 then rn else rn+1 end
    ) a
    where end_r <> start_l

    So which one do you think is better? How do you define better? It's an interesting question to me because I think the answer is ... and I hate this ... it depends.

    Steve Jones

  • OK, hows this?

    Note: Table creation modified to prove query does not require the rownum to be sequential or a column in the table.

    CREATE TABLE quiz1 (name varchar(50) NULL, start_dt datetime NULL, end_dt datetime NULL) ON 'PRIMARY'

    GO

    SET NOCOUNT ON

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19961001', '19981230')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19981230', '19990605')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '20000501', '99991231')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19910731', '19940201')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19940201', '19941021')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19990501', '20000331')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '20000331', '99991231')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '19980727', '20000103')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '20000601', '99991231')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '19990605', '19991002')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19941021', '19961031')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '19980103', '19980727')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Brian', '20000201', '20000301')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('Charles', '19910106', '19910731')

    INSERT INTO quiz1(name, start_dt, end_dt) VALUES('John', '20000103', '20000601')

    SET NOCOUNT OFF

    GO

    PRINT 'Solution 3'

    SELECT a.name AS Emp_Name, a.end_dt AS StartDate, b.start_dt AS EndDate

    FROM (SELECT COUNT(*) AS rownum, q1.name, q1.start_dt, q1.end_dt FROM quiz1 AS q1

       INNER JOIN quiz1 AS q2 ON q1.name = q2.name

        AND q1.start_dt > q2.start_dt

      GROUP BY q1.name, q1.start_dt, q1.end_dt) AS a

     INNER JOIN (SELECT COUNT(*) AS rownum, q1.name, q1.start_dt, q1.end_dt FROM quiz1 AS q1

       INNER JOIN quiz1 AS q2 ON q1.name = q2.name

        AND q1.start_dt > q2.start_dt

      GROUP BY q1.name, q1.start_dt, q1.end_dt) AS b

     ON a.rownum+1 = b.rownum

     AND a.name = b.name

     AND a.end_dt != b.start_dt

    GO

    Andy

  • Ok,

    My logic was that if there was a gap in employment, then the end date of the previous employment period must be less than the start of the next employment period. I ignored rownumber because although it is an identity column, you can't rely on the fact that the records were inserted in date order for all cases.

    Maybe I am being a bit simplistic and inefficient though, as I am using three copies of quiz1, maybe I could eliminate one by cunning use of GROUP BY and HAVING 

    anyhow I got the following...

    select q1.name,q1.end_dt 'From', q2.start_dt 'To'

    from quiz1 q1

    join quiz1 q2

    on (q1.name=q2.name

    and q1.end_dt < q2.start_dt

    and q2.start_dt = (select min(start_dt) from quiz1 where name=q1.name and start_dt>= q1.end_dt))

    order by q1.name,q1.end_dt

    David

    PS: Is that a new bike Steve?

    If it ain't broke, don't fix it...

  • Hard to answer the question in article... I don't like either of the two variants of solution. Generally, I would prefer Solution1 because of its simplicity, but in my opinion it is incorrect because it only works when entries are made in certain order. In reality, you can never rely on that.

    I'm not sure whether I haven't missed something, but it seems to me that the answer is as simple as

    SELECT a.[name], a.end_dt as unempl_start, MIN(b.start_dt) as unempl_end

    FROM quiz1 a

    JOIN quiz1 b ON b.[name]=a.[name] AND b.start_dt >= a.end_dt

    GROUP BY a.[name], a.end_dt

    HAVING MIN(b.start_dt) <> a.end_dt

    The only assumptions I have made is, that two employment periods are never concurrent (i.e. there can not be two start dates with no end date between them for the given employee). If that's not true, the query will give wrong results.

    EDIT : Only now I've looked at David's post and it seems we have used the same logic, it just happened that I managed to implement the GROUP BY and HAVING David is mentioning in his post.

  • In terms of the question posed, I'd have to say I prefer the first one.  I'm a big fan of keeping things as simple as possible, provided that you cover all possible datatable scenarios. In this case, (as Vladan mentions) this means that the solution needs to be modified to group the data so that the employment periods that you're trying to match are always sequential (ie., grouping primarily on name, and sub-grouping on start time).

    Given that the first solution doesn't cover the scenario of non-sequential data, it could be argued that the second solution is "better" - they both perform adequately on the datatable described, but the second one will continue to perform even if the data comes in a different order (a scenario that we could regard as likely to happen).  But I'd far rather see the simplicity of the first one, as described in Vladan's solution.

    (As an aside, note that if we assume that an employee cannot have concurrent periods of employment, we need only use one of start or end time, and they will result in the same grouping.  I have chosen start time to cover the possibility that the employee is currently employed, and thus his most recent period has a start but no end, which could result in incorrect order if we only group on end time).

    -----------------

    C8H10N4O2

  • Actually, looking at the datatable again, it seems to me that this data is a bit unrealistic, in that consecutive periods of employment start and end on the same day.  I would expect that one period would end the day before the next period started, eg "Feb 21 - Mar 17", "Mar 18 - May 25".

    In this (IMO, more realistic) scenario, we'd have to adjust all the queries to check instead for cases where b.start_dt was more than one day after a.end_dt.

    I realise it's just a hypothetical exercise for the purposes of a test, but If I were grading the responses, that observation would certainly get bonus marks...

    -----------------

    C8H10N4O2

  • Thanks Michael for catching the grouping, I forgot to clear my GROUP BY after I removed columns I had there for verification purposes. It really is enough to group on one date column, either start or date. Query edited. I used end date for grouping, otherwise I'd have to rewrite the query... in the supplied data, there is always end date '99991231' for those rows that show current employment (that was an assumption, too, but I forgot to mention it in my post) - but I have to agree that start date would probably be more reliable.

    As to the end and new start on the same date, well, that depends on how the periods are defined. I've seen it both ways in reality. There is even a possibility that the new employment could start on the same day OR the next day, and both would be considered uniterrupted employment... maybe even if the old period ends on Friday and new starts on Monday! However, in this respect I think that the definition was precise enough for test purposes.

  • Given the observation that the first solution relies on the ordering of the data, the second solution is the better one as it is more complete.

    But lets suppose that you don't have the luxury of modifying the underlying tables to arrive at a more elegant solution. Perhaps, for some obscure reason lost to the mists of time, the row ordering is required because other parts of the database or business solution depend on it. In that case, both solutions are correct, but that still leaves us with the original question.

    From a maintenance (or should I say a debugging at 2:30 am) perspective, I'd prefer solution one. Surgery is going to be easier in this case because the query is simpler.

    If the solutions are agreed to be accurate given the limitations of the original DDL, and the performance is roughly equivalent, then Occam's Razor applies and the first solution is better.

    ------------
    Buy the ticket, take the ride. -- Hunter S. Thompson

  • When there's a tradoff between simplicity and obvious function and performance perhaps with obscure functionality the choice can be determined by eventual use.

    If it is a rarely performed function, simplicity and and obviousness will be much more helpful especially if others need to be able to use the code as well.

    If it's in daily use, on big data sets then the opposite may be true.

    ...

    -- FORTRAN manual for Xerox Computers --

  • Very interesting queries were posted here

    Here is another one

     

    select * from

    (

    select n1.name,n1.end_dt start_unemp

     , (select min(n2.start_dt) from quiz1 n2 where N1.NAME=n2.name and n2.start_dt>n1.start_dt)  End_unemp

    from quiz1 n1

    ) kk

    where start_unemp<>end_unemp

  • This has been said already, and alternate examples given, but the first thing that jumped out at me is that both given solutions rely on the rownum column, and therefore rely on the data being ordered by name, start_dt.  I would say that neither solution is any good, then, and here's why.  I understand that this is a hypothetical example, but I'll even play along.  Given the same hypothetical example, do the initial inserts.  Now, pretend that Briand quits.  You'll have to update his end_dt, as follows:

    update dbo.quiz1

    set [end_dt] = '20060407'

    where [name] = 'Brian'

    and [end_dt] = '99991231'

    Now, suppose he comes back in a week:

    insert into quiz1(name, start_dt, end_dt) values('Brian', '20060414', '99991231')

    Neither of the given solutions will see this week-long unemployment period.  Therefore, I can't recommend either of them.  However, any of the solutions posted here that don't rely on rownum will work.

    If you're forcing me to choose one, though, I'd pick the first solution.  They're both flawed in that they rely on rownum, but one is very easy to read and one is not.  The only thing I would change is how the join is defined.  Microsoft recommends that the join conditions be defined in the from clause, not in the where clause.

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/acdata/ac_8_qd_09_66b7.asp

    I learned SQL the non-recommended way, not even 5 years ago, but quickly changed how I defined my joins once I saw the other way because it's so much easier to tell how tables are joined.  For joining 2 tables it's not bad, but when you join 15 tables, it's a lot more obvious which way is easier to read.  If you still define your join conditions in the where clause, I challenge you to try to learn the recommended way.  It doesn't really take long to figure it out and once you do, you'll never go back.  I promise...

  • Contrary to several above posts, both solutions in original post rely on ordering of the input data. 

    Given that, I think the Solution 1 is better just because it's simpler, more readable, and does not contain extraneous code.  Efficiency is a non-issue with this many rows.  I am curious to know which the company thought was better. 

    Solution 2 has extraneous code complicating it:  "(select name, start_dt, end_dt, rownum AS rn from quiz1)" can be replaced with "quiz1", replacing "rn" references with "rownun".  And "where  y.cnt <= 2" can be removed, serves no purpose.  Solution 2 also uses lousy column names for derived tables, not easy to figure out how this works or what author was trying to do.

    Only solution that I personally like here is one from Vladan; it is simple, understandable, and does not rely on ordering of the input data.

  • Solution 1 is visually more elegant.

  • Talking only within the given scenario ie without thinking about another better solution or table structure... Solution 1 is the best one... why?

    By using SET SHOWPLAN_ALL ON we can get statistics, numbers to compare:

    TotalAverage (the max value is the better)
    SolutionEstimate RowsEstimate IOEstimate CPUEstimate RowsEstimate IOEstimate CPU
    118.340.0439070010.0002378551.001.001.00 <--- This is the better
    2189.5866660.0488397620.0007845380.100.900.30 <--- This is the worst
    Estimate Rows

    It's ~

    10.33733184 times better
    Estimate IO

    It's ~

    1.112345662 times better
    Estimate CPU

    It's ~

    3.298396596 times better
  • The real 'elegance' in this thread is Edmundo's post.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply