Distinct going slower than *

  • Derek Dongray (3/5/2009)

    Yes, I noticed just after I'd spent an hour perusing your plan and trying to work out what the view did. πŸ˜€

    First off, my sincerest apologies on that ..... should have modified that first post as soon as I posted the new code. Wasn't expecting a new face to join the fray πŸ˜€

    I've only glanced at your other plans. My first reaction is always to look for the high percentages. In all three cases thats the Clustered Index Scan on IREP_T_WF_INFO searching on the field WF_STOP for the date filter.

    If this is going to be a heavily used query, an index on that field might help. Of course, if there are a lot of inserts/deletes on that table you might badly affect performance elsewhere.

    That really is the problem and it's hard for me to judge the overall impact.

    The data in this database is summarized information from a very large application database we have. The sole purpose of the database in question is to provide users the ability to run reports while not affecting the performance of the application itself. The data is synced every hour so inserts/updates are happen in large chunks and are rather frequent. Still, the purpose of this data is to report on it and the WF_STOP column is a main search criteria.

    I'll add the index to that column and see what affect that has on both the searches and a sample of the mass insert/update and see if it's acceptable.

    Thank you very much for the suggestion ..... I spent so much time scrutinizing the JOINs that I overlooked the WHERE clause itself. Plus, the first select runs so quickly that it never popped into my head to question it.

  • Derek Dongray (3/5/2009)


    Actually, I vaguely recall hearing that before.

    These days I just avoid UDFs wherever possible, 'cos I know there's going to come back and haunt me later! πŸ˜€

    They're cool, so long as you put them right at the top of the food chain!

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • WF_STOP index is not helping either .....

    CREATE NONCLUSTERED INDEX [IX_IREP_T_WF_INFO_WF_STOP] ON [dbo].[IREP_T_WF_INFO]

    (

    [WF_STOP] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_IREP_T_WF_INFO_WF_START] ON [dbo].[IREP_T_WF_INFO]

    (

    [WF_START] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    still 22 seconds on the group by.

  • GilaMonster (3/5/2009)


    Derek Dongray (3/5/2009)


    Also I'd be interested to see the source of the User-defined function FFG_FX_JUST_DATE since, although it doesn't add the the estimated cost

    Scalar functions never do. It's one of the problems with them. No matter what the function does, the call to it will appear as 0% cost in an execution plan. The cost estimate in an execution plan cannot be trusted for a user-defined function.

    Now, that's interesting Gail, any official MS evidence to back this up? Links will do.

    Thanks,

    Jan

    --------------------------------------------------------------------------
    A little knowledge is a dangerous thing (Alexander Pope)
    In order for us to help you as efficiently as possible, please read this before posting (courtesy of Jeff Moden)[/url]

  • Jan Van der Eecken (3/5/2009)


    Now, that's interesting Gail, any official MS evidence to back this up?

    No, just lots of experience reading exec plans. Tables accessed within a scalar function doesn't appear in the stats IO output either. Makes functions an absolute nightmare to tune, especially if one doesn't have direct access to the server.

    I do intend to write a blog post on it, and I do plan on logging it on Connect. Will either of those do?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Sure it will, appreciate it if you would. Have been looking into exec plans with CROSS APPLY from a physical table to a UDF returing a table today as well. One gets very weird plans. depending on the inputs.

    Any articles/blogs regarding UDF's and plans I am sure would be highly appreciated by the community.

    PS: How's the JoBurg User Group going?

    --------------------------------------------------------------------------
    A little knowledge is a dangerous thing (Alexander Pope)
    In order for us to help you as efficiently as possible, please read this before posting (courtesy of Jeff Moden)[/url]

  • Putts (3/5/2009)


    WF_STOP index is not helping either .....

    I has a suspicion it might not help as it was common to all queries. Did you check if the plan changed at all?

    The other quick suggestion is that it might help adding included leaf fields, the idea being to get the engine to read the information from the index rather than scanning the main tables, thus reducing I/O, but that might hit your update/insert jobs.

    CREATE NONCLUSTERED INDEX [IX_IREP_T_WF_INFO_WF_STOP] ON [dbo].[IREP_T_WF_INFO]

    (

    [WF_STOP] ASC

    )

    INCLUDE (

    [ID],

    [DOCTYPE_CD],

    [STATE_CODE],

    [WF_STATUS]

    )

    WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,

    SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF,

    DROP_EXISTING = OFF, ONLINE = OFF,

    ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

    ON [PRIMARY]

    {Edit: Wrapped code so it fits on screen}

    I'll see if I can find time for another look later today.

    Derek

  • What is really eating at me (and why I originally decided this was a question for bigger SQL brains than myself to answer) was that I just don't understand how the original query runs so quickly but all attempts to limit the number or results increases the time 1000%.

    Most of the hints that people have offered have been basic "performance tuning" but I think the answer must lie outside that realm.

    I understand that the DISTINCT and GROUP By will take more time to process but, even with that being the case, when you're getting such fewer results back you still see a net gain in time from "execute query .... get results".

    I just can't wrap my head around what change I could make that is going to only help Group and Distinct certain columns. It' just so frustrating having to look my boss and testers in the eyes and say I have no idea why I can't get it to go any quicker πŸ™

  • I have been following this thread since it started. It is now 54 posts long and I am getting a little confused and I really don't feel like starting at the beginning and reading all the posts again to try an clear things up, so I have a question for the OP.

    Would you please summarize the problem again for us and you are free to base things off what you may have learned from this thread as well. I just need to get a handle again on what you are try to accomplish. Please include some examples of the code you are working with as well.

  • Putts (3/5/2009)


    Okay.

    Was able to get acceptable times on the original query I was working with but now I'm working on a different one and am running into the same problem.

    Following y'all's advice I have modified my DISTINCT to be a GROUP BY but that is not saving any time for the results to come back. Please note: the time I am worried about is the time it will take to process and return the results. I know it breaks the hearts of many DBAs out here but I am not concerned with the time it takes the database to process the actual request but how long it takes to get the data set.

    Just a point of argument, of course the total time is important, absolutely it is. But let's face it, you're seeing exactly why the DBA's get all frothy about the processing time on their end. It really can completely overwhelm other concerns.

    1. SELECT DGRP_ID,DGRP_TITLE FROM IREP_V_POSITION_STANDARDS WHERE WF_STOP BETWEEN '2/1/2009' AND '2/28/2009'

    This is showing a missing index:

    USE [RPT_SQL_CL]

    GO

    CREATE NONCLUSTERED INDEX [ ]

    ON [dbo].[IREP_T_WF_INFO] ([WF_STOP])

    INCLUDE ([ID],[WF_STATUS],[DOCTYPE_CD],[STATE_CODE])

    GO

    You might want to try that out. It's scanning every single index. That usually means you're indexes are bad or the query is bringing back too much data. Since you're returning 25,333 rows, i'd say it's the latter in this case.

    2. SELECT DGRP_ID,DGRP_TITLE FROM IREP_V_POSITION_STANDARDS WHERE WF_STOP BETWEEN '2/1/2009' AND '2/28/2009' GROUP BY DGRP_ID,DGRP_TITLE ORDER BY DGRP_ID,DGRP_TITLE

    This one also has a missing index

    USE [RPT_SQL_CL]

    GO

    CREATE NONCLUSTERED INDEX [ ]

    ON [dbo].[IREP_T_WF_INFO] ([WF_STOP])

    INCLUDE ([ID],[WF_STATUS],[DOCTYPE_CD],[STATE_CODE])

    GO

    It's probably the same on the last query. I think the issue is a bit more fundamental than an index though. I think the structure might not be properly supporting the data that you're trying to return. One possibility is to set up a pre-aggregate table that has the data restructured the way you need it for these queries. Another is to examine the fundamentals on the other structures to determine they're storing data appropriately in order for you to get it back out.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Lynn, if you read my post at the top of page 4 that will totally bring you up to speed with where I'm at right. Same issues as it's always been just a different query.

    Select runs quickly but a Select Distinct and Select Group By both run very slowly.

    Grant Fritchey (3/6/2009)

    It's probably the same on the last query. I think the issue is a bit more fundamental than an index though. I think the structure might not be properly supporting the data that you're trying to return. One possibility is to set up a pre-aggregate table that has the data restructured the way you need it for these queries. Another is to examine the fundamentals on the other structures to determine they're storing data appropriately in order for you to get it back out.

    Grant, I did create that missing index but, as you assumed, it did not help the performance.

    I would agree with your sentiment that there must be something awry with the structure of my tables or the definition of my view but I still can't figure out why it only affects Distinct and Group bys. I think if someone can explain that to me then I might have a starting place of where to work.

    I mean, if the structure is truly mucked up someplace then why does the Select run so quickly?

  • Putts (3/6/2009)


    Grant, I did create that missing index but, as you assumed, it did not help the performance.

    I would agree with your sentiment that there must be something awry with the structure of my tables or the definition of my view but I still can't figure out why it only affects Distinct and Group bys. I think if someone can explain that to me then I might have a starting place of where to work.

    I mean, if the structure is truly mucked up someplace then why does the Select run so quickly?

    You created the index with the include columns and it didn't help? Did it change the execution plan at all?

    I'm sorry it didn't work, but I'm not surprised.

    The thing about DISTINCT and GROUP BY is that they have to scan all of the data within the range provided by the WHERE clause. In the case of your view, it looks like the WHERE clause that you're providing doesn't filter on the tables where the aggregate values are coming from, so it's going through the entire table to arrive at the list for aggregation and then the filter is applied elsewhere. So it appears that you're not getting much, if any filtering and so the result set that has to be aggregated is very large. You need to reduce that set first, then the aggregation, if still needed, will run much faster.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • For the sake of prosperity ..... I have just modified how I'm doing this.

    Once you have all the indexes/statistics in place that you can possibly add, there's little more you can do in this situation to increase speed. If you're doing a group by or Distinct on tons of rows, the performance of returning the recordset is going to dip. As far as I can tell, there's no getting around this.

    If you need to up the performance then you will have to re-work your query. What I did for my query was I do an IN statement like this....

    SELECT DGRP.DGRP_ID,DGRP.DGRP_TITLE

    FROM

    IREP_T_DOCTYPE_GROUP DGRP

    WHERE DGRP_ID IN

    (

    SELECT DGRP_ID

    FROM

    (SELECT WSA.WS_INSTANCE_ID,WF_ID,WSA_ASSIGNED_NM,SD_STEP_ID,WSA_START_TS FROM [AS_SQL_CL].dbo.WF_STEP_ACTIVITY WSA WITH (NOLOCK) INNER JOIN [AS_SQL_CL].dbo.WF_ITEM_REL WIR WITH (NOLOCK) ON WSA.WS_INSTANCE_ID = WIR.WS_INSTANCE_ID WHERE WSA_STATUS_CD = 'S' AND WSA_STEP_IND = 'S' AND WR_STATUS_CD = 'A' GROUP BY WSA.WS_INSTANCE_ID,WF_ID,WSA_ASSIGNED_NM,SD_STEP_ID,WSA_START_TS) Steps

    INNER JOIN

    IREP_T_INBOXES Inboxes ON Steps.WSA_ASSIGNED_NM = Inboxes.WSA_ASSIGNED_NM

    INNER JOIN

    IREP_V_EMPLOYEES Users ON Users.INBOX_ID = Inboxes.ID

    LEFT OUTER JOIN

    IREP_T_WF_INFO Info ON Steps.WS_INSTANCE_ID = Info.WS_INSTANCE_ID

    LEFT OUTER JOIN

    IREP_T_DOCTYPE_GROUP_REL DGRL ON Info.DOCTYPE_CD = DGRL.DCTP_ID

    )

    GROUP BY DGRP.DGRP_ID,DGRP.DGRP_TITLE

    ORDER BY DGRP.DGRP_TITLE

    The end result is not any different than just lumping that first table into the large join and doing a GROUP BY or DISTINCT on the whole thing ..... but there's a huge difference on how it's put together. The "one large join" approach is going to return the tons of rows and then try to limit them down. The method of using the IN method just initially grabs the small(er) table you're trying to select from and, for each item in the table, begins a check against the join to see if it's in there and, once it finds the value once, it returns true and stops the seek against the join.

    I was reluctant to even try this initially because I was creating the query on the fly and didn't want to build exceptions into my code for this. My suggestion for anyone doing similar coding ....... force your code to use a stored proc. Then encapsulate your queries into the SProc. Then, if you need to do custom work like this on just a handful of queries, you can do that in their individual SProcs without having to touch your code.

    Hope that helps anyone that was in my predicament.

Viewing 13 posts - 46 through 57 (of 57 total)

You must be logged in to reply to this topic. Login to reply