improve the query performance

  • dlam 18073 (1/20/2011)


    DateCall >=@callFrom and DateCall < @callTo <<< does this part of code already set the range of dateCall

    without

    not exists

    (

    select 1

    FROM tblCallLog

    WHERE phNu=r1.phNu

    AND DateCall > @callTo

    )

    also it select all of the max date call with in the range over all of the max date call

    for example:

    test data

    402111, 2010-01-10 15:15:00.000, n, 32226001, b smith

    402222, 2010-03-21 15:16:00.000, n, 32226001, b smith

    402333, 2010-10-19 15:17:00.000, n, 32226001, b smith

    402444, 2010-11-21 16:01:00.000, n, 32226001, b smith

    with this query

    select *

    from tblCallLog r

    Where DateCall =

    (

    select max(DateCall)

    from tblCall r1

    where r1.phNu=r.phNu

    and logStatus in (@a, @b-2, @b2,@c,@d,@d2,@d3,@n,@n2,@n3,@n4,@r,@r2,@s,@s2,@u,@w,@n5)

    and DateCall >=@callFrom and DateCall < @callTo

    group by phNu

    )

    call date range

    DateCall >='2010-10-01' and DateCall < '2010-11-30 23:59'

    the outcome is >> 402444, 2010-11-21 16:01:00.000, n, 32226001, b smith

    but is there any different between puting the where clause in the sub-query and out side the sub-query

    I am not sure I fully understand what you are trying to tell me here, but I can tell you that in your example you did not include any data which the clause you removed was designed to handle since the latest call in your example also fell within the date range you set -- what does the query return if you add a call for that same number but with the call taking place in the month of December 2010? Without the "not exists" clause, it should continue to return the same record it is returning now, but as I understand your stated requirements, it should not return anything. If you add back in the "not exists" clause that should change it so it won't return anything.

    As for the question of putting the where clause in the subquery or outside of the subquery, I am not sure how to answer as the code I gave you had two subqueries and three where clauses so I don't know which where clause you are talking about relocating nor to where you are contemplating relocating it. Your general question, however, seems to be due to a lack of understanding of the function of the "not exists" subquery. Its purpose is to eliminate any numbers which have calls which were made after the upper bound of your range. Without it, all of the calls reported will be within the range you have set, but all of the calls reported will not necessarily be the latest calls for the respective number. You should be able to prove that to yourself by adding in the additional data as I indicated above. In your original query, you accomplished this goal by first finding the latest call for every number and then looking only at those to see which ones were in range -- but you wanted to improve the performance. I tried to improve the performance by not looking at every single number to find the latest call but, instead, finding the latest of the calls within the stated range and then eliminating any of them for which the related number has calls later than the stated range. To do the elimination, you need the "not exists" clause.

    - Les

  • I understand your query now, I know its abit late to say that.

    Related question about the code below

    select *

    from tblCallLog r

    Where DateCall =

    (

    select max(DateCall)

    from tblCall r1

    where r1.phNu=r.phNu

    and logStatus in (@a, @b-2, @b2,@c,@d,@d2,@d3,@n,@n2,@n3,@n4,@r,

    @R2,@s,@s2,@u,@w,@n5)

    and DateCall >=@callFrom and DateCall < @callTo

    and not exists

    (

    select 1

    FROM tblCallLog

    WHERE phNu=r1.phNu

    AND DateCall > @callTo

    )

    group by phNu

    )

    my query below

    select *

    from tblCallLog r

    where DateCall=

    (select max(DateCall)

    from tblCallLog

    where phNu=r.phNu

    group by phNu) and logStatus in(@a, @b-2, @b2,@c,@d,@d2,@d3,@n,@n2,@n3,@n4,

    @r,@r2,@s,@s2,@u,@w,@n5)

    and DateCall >= @callFrom and DateCall < @callTo

    order by DateCall

    why those where conditions put in the subquery is more efficient,

    both query also checks the same table more than 1 time,

    yes, I asked the similar question on my previous post, but still not get it

  • dlam 18073 (1/23/2011)


    I understand your query now, I know its abit late to say that.

    Related question about the code below

    select *

    from tblCallLog r

    Where DateCall =

    (

    select max(DateCall)

    from tblCall r1

    where r1.phNu=r.phNu

    and logStatus in (@a, @b-2, @b2,@c,@d,@d2,@d3,@n,@n2,@n3,@n4,@r,

    @R2,@s,@s2,@u,@w,@n5)

    and DateCall >=@callFrom and DateCall < @callTo

    and not exists

    (

    select 1

    FROM tblCallLog

    WHERE phNu=r1.phNu

    AND DateCall > @callTo

    )

    group by phNu

    )

    my query below

    select *

    from tblCallLog r

    where DateCall=

    (select max(DateCall)

    from tblCallLog

    where phNu=r.phNu

    group by phNu) and logStatus in(@a, @b-2, @b2,@c,@d,@d2,@d3,@n,@n2,@n3,@n4,

    @r,@r2,@s,@s2,@u,@w,@n5)

    and DateCall >= @callFrom and DateCall < @callTo

    order by DateCall

    why those where conditions put in the subquery is more efficient,

    both query also checks the same table more than 1 time,

    yes, I asked the similar question on my previous post, but still not get it

    First of all, I can't say definitively that my query is any more efficient than yours -- even if I had your table definitions, index definitions, test data, etc., the query optimizer on my server might compile the queries differently than yours does, though if we had everything the same, similar software releases, etc., they likely would be pretty similar. But from an intuitive view of what tasks need to be done in the different cases, I would expect it to be somewhat more efficient. The reason for that is that, your query explicitly indicates that the latest call for all numbers should be determined, and then that list of calls should be filtered to eliminate any which don't have the correct status or for which the call is not within the range of dates specified. Let's say that the database has 2,000,000,000 calls and 5,000,000 phone numbers with calls -- that means that there are 5,000,000 latest calls to look at, but most likely, very few of those calls meet your other criteria. For the moment, let's ignore the call status since it is pretty much the same in both queries, and just say that:

    • all of the 2,000,000,000 calls have the correct status;
    • 300 numbers have 1000 calls in the time period specified;
    • but only 50 of them are the latest call in that time period (which would mean that 250 of the numbers had their latest call after the time period);

    That means that, if your query is done exactly as written, SQL Server is going to look at 2,000,000,000 calls, determine the 5,000,000 latest calls, and then pick the 50 of those which are within the range.

    In my query, if done pretty much as written, it would look at 2,000,000,000 calls, find the 1000 which are within the range, then for each of the 300 numbers check to see if they have a single call later than the time period (it doesn't have to be the latest call for the number -- just one call which is later than the time period), eliminating 250 numbers, and then for the remaining 50, find the latest call for each number from within 1000 calls in the time period. (Actually, it doesn't even have to look at 1000 -- it only has to look at calls for the 50 numbers involved which given the numbers suggests that we would be looking at fewer than 200 calls)

    Even with the right index, finding the latest call for a number from within a group will likely involve a few comparisons per number -- doing so for 50 numbers and looking over a limited range which includes 1000 calls is apt to be considerably quicker than doing so for 5,000,000 numbers over all times possible involving 2,000,000,000 calls. With the right index, finding the calls within a range should be relatively quick, as will determining if there are any calls outside of the range. So, pretty much the most expensive thing here is finding the latest call but you start off doing that for every number in the database even though you aren't interested in the vast majority of them.

    The big unknown here is: "how does the query optimizer see the problem?" It might be possible that the query optimizer has sufficient logic to look at your query and break down the requirements in the same way I did and conclude that it could do the job by approaching it the same way I did in writing my query, in which case, the execution plans would likely be the same -- but I seriously doubt that it could actually do that. So, my guess is that my approach will produce a more efficient execution plan (presuming, of course, that you also have reasonable indexes defined on the tables so that it doesn't wind up in either case just looping through all 2,000,000,000 calls over and over). Only you can verify whether that is true, however -- it's your database.

    - Les

  • The thing I didnt do at the beginning, was test the query in a very large database, was large enough I thought, then the process timeout in the application. This time, I will put it into a large enough database to test every query that I got.

    Thank so much to you to answer the question in detail, and give me many ideas in how to see the problem.:-)

  • dlam 18073 (1/24/2011)


    The thing I didnt do at the beginning, was test the query in a very large database, was large enough I thought, then the process timeout in the application. This time, I will put it into a large enough database to test every query that I got.

    Thank so much to you to answer the question in detail, and give me many ideas in how to see the problem.:-)

    You're welcome. I hope it proves helpful.

    - Les

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply