Performance Problem when using partitioned view against actual table

  • Cory, my first recommendation is to avoid it if possible.

    The only reason we use it is we have individual tables that can have well over 200 million records added per day, and the partitions make that a bit more bearable when we do maintenance work, ad hoc queries, backups, etc.

    I'm working under the assumption that you are familiar with partitioned views, and the required constraints, etc. If not, let me know and I'll fill you in. Regular views don't give you the same advantage with what we're doing, so PVs are a requirement.

    The partitioned views are really for the source data we use to mart, as the marted tables are much more manageable in size, and it's always preferable to use a single table over a PV when size isn't a problem (or location, but DPVs are a different story). Since it's marting source data, we rarely need the older stuff, hence keeping only 14-20 days in individual day tables. The main reasons we save the source longer is in case something was marted improperly, or for the occasional ad hoc request that can't be derived from the marted tables. Our nightly process moves the daily source table to a database for the current month, with a date appended table name, the necessary constraint. If it's a new month, it will create a new database for it. It also alters the partitioned view to include the new table (in actuality, it goes through all of the databases and recreates the view from scratch, but that's the concept). Once a week, the 7 daily tables that are over 14 days old are rolled up into a week table, and the daily ones are temporarily renamed (we delete them quickly, but this is just in case something went wrong and we need to revert back quickly). The purpose of rolling them up into weeks is to allow us to keep a decent amount of history without exceeding the table quantity limits of views and queries. Obviously, the constraints are changed for the weekly tables to have the week date range, instead of a single day. All of the above is fully automated, so unless the job fails for some reason, there isn't a lot to deal with. If something goes wrong during the job (usually malformed source data that wasn't expected, although we have processes to catch most of that and dump to a table for manual intervention. Bad programmer, no donuts.), while it's painful due to the huge volumes of data, we've kept things in place to be able to revert back, so it's tolerable.

    The nicest part is that we can just use the PVs for any ad hoc stuff, without worrying about finding out which week that date was in, or dealing with a 100 billion row table (Boss, I'm doing a table scan. I think I'll go ahead and take that 4 weeks of vacation I'd saved up).

    That should get you started, but I'll happily answer any questions I can.

  • Just to add to what David said, I'm using it as I want to be able to do filegroup backups, as I know that its only the recent month that is likely to change, and not the table data for 6 months ago.

    It is also good for maintenance as well, as you can drop filegroups to free up space etc.

    In my case, I am experiencing some performance problems as I have detailed above and still cant find a solution to it.

     


    Kindest Regards,

    John Burchel (Trainee Developer)

  • John, now I'm confused again. Your tables are constrained by the month as near as I can tell, yet your latest query crosses over months, so even a well functioning PV would use two tables. You said you were comparing this to running against a single table, so I am lost. Oh, you still have that GROUP BY in there, which shouldn't even allow that query to run, so I'm doubly lost. I attempted to write a query as close as possible to your's using a smaller partitioned view.

    -- Partitioned View

    SELECT

     ymd_id

     ,eventpk

     ,sessionid

     ,machineid

     ,syndicationid

    FROM

     v_event_logdata_all

    WHERE

      eventtype = 117

     AND ymd_id BETWEEN 20070402 AND 20070406

    ORDER BY

     syndicationid DESC

     ,sessionid DESC

    -- Derived Table

    SELECT

     ymd_id

     ,eventpk

     ,sessionid

     ,machineid

     ,syndicationid

    FROM

     (select eventpk,ymd_id,eventtype,eventtime,visitorid,syndicationid,sessionid,data1,machineid,clientid,inserttime

     from event_logdata_20070406  union all

     select eventpk,ymd_id,eventtype,eventtime,visitorid,syndicationid,sessionid,data1,machineid,clientid,inserttime

     from event_logdata_20070405  union all

     select eventpk,ymd_id,eventtype,eventtime,visitorid,syndicationid,sessionid,data1,machineid,clientid,inserttime

     from event_logdata_20070404  union all

     select eventpk,ymd_id,eventtype,eventtime,visitorid,syndicationid,sessionid,data1,machineid,clientid,inserttime

     from event_logdata_20070403  union all

     select eventpk,ymd_id,eventtype,eventtime,visitorid,syndicationid,sessionid,data1,machineid,clientid,inserttime

     from event_logdata_20070402) AS a

    WHERE

      eventtype = 117

     AND ymd_id BETWEEN 20070402 AND 20070406

    ORDER BY

     syndicationid DESC

     ,sessionid DESC

    -- With PV, averaged about 20 seconds

    -- With Derived table, averaged about 20 seconds

    -- "SELECT Count(*) FROM v_event_logdata_all" took over 8 minutes to run

    -- Rows covered by Partitioned View: 849,561,678

    -- Rows in Table20070402: 4,509,135

    -- Rows in Table20070403: 4,431,305

    -- Rows in Table20070404: 4,315,899

    -- Rows in Table20070405: 3,985,089

    -- Rows in Table20070406: 3,985,089

    You can see that it's obvious that the PV only touched those tables that were required. The constraint is on our ymd_id column, and that column is also indexed. The other column in the where clause as well as the two columns in the ORDER BY, don't have indexes on them at all. Additionally, looking at the execution plan, there were only table scans for the tables in question, even on the PV code.

    While I'd still like to see the actual code you ran (remember, that code you posted would just error out because of the GROUP BY), my guess is that the issue is one of the following:

    One or more of the tables do not have constraints, or have invalid constraints (overlapping, etc).

    The two queries you are running have some difference (could be subtle or even a typo) other than that one is running against the base tables while one is running against a PV.

    There is something weird in SQL Server constraints when you have a datetime range, as you do. That's actually the main difference I see between our setups. While I have datetime columns, the constraint is on an integer column that links to a Time Dimension table, so every row in the table has the exact same value as the constraint.

    I'd look closely at your execution plans and see if anything odd is there. It should be.

     

  • sorry David, this is the query

    SELECT PERSONURN, ADDRESSURN, PROFILEDATE,CLIENTNO, STATUS

    FROM VwQuery

    WHERE LISTID = 'L01'

    AND PROFILEDATE BETWEEN '2007-02-01 23:59:59.000' AND '2007-03-19 20:11:23.000'

    The query spans over two months, so i had expect the query plan to search two months for the records, which it rightly does.

    Where the problem lies is when the line below is added

    "ORDER BY PROFILEDATE DESC, CLIENTNO DESC"

    as in

    SELECT PERSONURN, ADDRESSURN, PROFILEDATE,CLIENTNO, STATUS

    FROM VwQuery

    WHERE LISTID = 'L01'

    AND PROFILEDATE BETWEEN '2007-02-01 23:59:59.000' AND '2007-03-19 20:11:23.000'  "ORDER BY PROFILEDATE DESC, CLIENTNO DESC"

    The execution plan shows a "sort" (which is the order by) taking 90% of the query cost, here is the problem and I'm looking for a way to tune this up

    Hope this explains all


    Kindest Regards,

    John Burchel (Trainee Developer)

  • David,

    To also add, please check your PM's


    Kindest Regards,

    John Burchel (Trainee Developer)

  • Can you do one quick favor? Drop the CLIENTNO DESC from the ORDER BY clause in the query that you're running against the view and report back. This is very odd. I'm sort of hesitant to take this to PMs at this time, as this might be educational for all of us, including myself.

  • Hi

    If I take off the order by Desc on both columns or change it to Asc, then it runs really  really quick.

    But users are more interested in the latest profile date, so the order by Desc is really important.

     


    Kindest Regards,

    John Burchel (Trainee Developer)

  • Nope... you should almost never include an Order By in the view... (there are exceptions but this isn't one of them)... if the users want ordered data, they should include it in a query (why the heck are you letting mere users direct access to the database? ).  As you've just found out, it is "death by SQL" to include ORDER BY in a view (especially DESC) for the mere convenince of users.  Write a stored proc that will read from the view and order it in descending order.  You can even include a parameter to pass the number of rows desired with a hardcoded limit of, say, 1000 rows or some such.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • That is exactly what I have (i.e S.procedure and not code directly to users, I only placed the code here to that we can all see whats running at the background), I have a stored procedure that takes in date parameters eg profile start and profile end dates and also the rowcount parameter to limit the number of rows to be returned at any one time, the problem is that even if the rowcount is set to 100 and date is ordered by (DESC), then there is a performance problem (takes ages).

     


    Kindest Regards,

    John Burchel (Trainee Developer)

  • Just out of curiosity, is there any other way of ordering data Desc without specifying Desc in the code ?


    Kindest Regards,

    John Burchel (Trainee Developer)

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply