Putting a condition on a subgroup to query result

  • Hi JonFox,

    I attach the full query (.txt) and execution plan (.sqlplan) here. Thanks.

  • yingchai (11/28/2011)


    Hi JonFox,

    I attach the full query (.txt) and execution plan (.sqlplan) here. Thanks.

    Thanks yingchai! I'll have to take a detailed look at this tonight after work, but at first glance, it looks like a lot of your performance issues are due to the multiple scans on the financial_datas table. I'll get back with you this evening after I've had a chance to look at the query. There's probably a way to fix this, either by rewriting the query or adding some indexes. I wouldn't jump to implement the index suggested by the query plan just yet though, let's see if the query itself can be tuned first. Also, when was the last time statistics were rebuilt for the indexes on the financial_datas table? The "estimated" vs "actual" rows returned from that table appear to be wildly different...

  • OK...been tearing into this quite a bit tonight, and I think a rewrite may be in order for the parts of the query that pull the actual financial information. There's probably one or two things that could be done to speed up the recursive function, but I don't think that's where your big bottleneck is anymore, so let's leave that to the side for now. You can see this in the actual execution plan; it shows that there are actually six separate Clustered Index Scans against the financial_datas table, and if you take an even closer look, you'll see that one of them was actually executed 679 times, and another was executed 393 times, for a total of 1056 scans of that table. How many rows are in the financial_datas table? Multiply that by 1056, and that's how many rows that SQL Server had to examine, from that one table alone, to satisfy this query! So, to start with, we need to figure out how to cut down the number of times we have to scan that table, and hopefully, stop scanning it altogether and just grab what we want via an index SEEK operation or two. The things I would try to do in order to tune this are, in descending order of probable impact:

    1.

    The first big performance killer, it appears, is the structure of the interco_elim_BS_2 view. As currently written, it has to union together three different queries against the same tables, and each of those queries actually includes a self-join between the same tables...hence the six main branches in the execution plan that include the table scans on financial_datas. Worse, I think two of those queries are returning way too many rows; I notice that only the third query in that view is limiting the join between AR_Sum and AP_Sum by adding the month and year criteria (meaning it should really be an INNER JOIN instead of a LEFT OUTER JOIN, with the year and month comparisons added to the join criteria.) That makes me think you should be joining by year and month in the other two queries as well; otherwise you're generating a result set that contains more rows than either of the tables in the join. In other words, if table a has rows for one organization for five different months, and table b has rows for the same organization for the same five months, but you only join on organization, you'll get back 25 rows instead of the 5 you actually wanted. When you do this with two large tables, the result can be a disaster...

    2.

    Once you get the joins sorted out properly in the interco_elim_BS_2 view, see if you can replace the three UNIONed queries with a single query, utilizing a FULL OUTER JOIN between AR_Sum and AP_Sum, and using COALESCE statements to generate the correct values of the AR_org, AP_org, AR_interco, AP_interco, AR_total, and AP_total columns. For example, COALESCE(a.total, 0.00) AS AP_total If you can get this to work as a single query, you'll cut down the reads on your main tables by roughly 1/3rd.

    3.

    Now, the way the organization names are being joined to your descendants function. In each of the *SUM views, you're concatenating the organization and org_name columns:

    organization + ' - ' + org_name AS org

    Then, in your final query, you're grabbing the first seven characters from the beginning of the column that you just concatenated in the views, and using that to join against the result set from your descendants function:

    SELECT AR_org, AR_interco, AR_acc, AR_curr, AR_Month, AR_Year, AR_total, AP_org, AP_interco, AP_acc, AP_curr, AP_Month, AP_Year, AP_total

    FROM interco_elim_BS_2

    WHERE (SUBSTRING(AR_org, 1, 7) IN

    (SELECT DISTINCT organization

    FROM dbo.OrgelimBSDescendants('SUNWAY GROUP') AS OrgelimBSDescendants_2)) AND (SUBSTRING(AP_org, 1, 7) IN

    (SELECT DISTINCT organization

    FROM dbo.OrgelimBSDescendants('SUNWAY GROUP') AS OrgelimBSDescendants_1))

    and AR_Month = 'Jun' and AR_Year = '2011'

    By performing these string operations before doing the comparison, you're forcing the query engine to first retrieve all possible rows that match the view interco_elim_BS_2 (just as if you did a SELECT with no WHERE clause from that view), and then compare the results of the string operations to the data set returned by the function in order to weed out the rows that aren't wanted. Instead, you should include the unchanged organization column in your views and join on that, so that the optimizer will be able to choose a plan that can better utilize any indexes on accounts.organization and organizations.organization.

    4.

    After you've gotten all this sorted out, take a look at your indexes. See if you can leverage an existing index on the financial_datas table, or add a new one if necessary (take a look at the one recommended at the top of the execution plan as a starting point), and try to turn those SCAN operations into SEEKs. Then, take a look at the indexing on org_hrchy_details and consider adding an index on the "parent" and "hierarchy" columns. That's probably a much smaller table than financial_datas, so the impact won't be nearly as great, but every little bit helps! Still, I wouldn't recommend worrying about indexing until you've taken care of the first three issues.

    5.

    Consider ditching all the views, especially if you created them just for this query. They aren't really necessary, and I think they may be hurting you here by making some of the stuff going on less obvious. Plus, taken too far, lots of nested views can put unnecessary work on the query optimizer (and sometimes result in it choosing sub-optimal plans). Again, this probably isn't a significant performance issue for this query, but I'd still suggest avoiding them in order to keep it simple and clean.

    ...

    Also, a quick observation about something else in your views: the TOP (100) PERCENT...ORDER BY structure isn't necessary, and is ignored by SQL Server 2008. Any ordering needs to be applied to the final result set, since VIEWs cannot be ORDERed. Here's a bit more info: http://blog.sqlauthority.com/2009/11/24/sql-server-interesting-observation-top-100-percent-and-order-by/

    Whew, that turned into a pretty long post! I hope this at least helps you get on the right track; let me know if you have specific questions about any of this, or if I just managed to thoroughly confuse you. 😀

  • Hi JonFox,

    Thanks for the detailed explanation...will try to digest it and probably I'll modify the query.

  • Last night I decided to throw together some (simplified) test data and example code to illustrate some of the points in my previous post, and in the process, realized that this can be simplified even further (and made even faster in the process). You don't even need to join the AP and AR totals together in order to get what you need, so my previous recommendation of a FULL OUTER join is still overkill. Instead, you can create conditional SUMs of the AR and AP amounts in one pass, by basically pivoting them out into separate columns using CASE logic.

    The example code below is inspired by yours, but since I don't have access to your actual table definitions and data, I've greatly simplified things by putting a few rows of test data into a single temp table to use as the base of the query. Still, I think it shouldn't be too difficult to figure out how to adapt this approach to meet your own needs:

    -- If our testing temp table already exists, drop it first

    IF OBJECT_ID('tempdb..#transactions') IS NOT NULL DROP TABLE #transactions;

    GO

    -- Create a new temp table to hold our test data.

    -- NOTE: in your real data, this would actually be your account, financial_datas, etc. tables.

    -- To make this example work with your data, you'll need to do a little work to plug in your tables in

    -- an appropriate manner.

    CREATE TABLE #transactions

    (

    org VARCHAR(30) NOT NULL,

    account CHAR(5) NOT NULL,

    period DATE NOT NULL,

    amount DECIMAL(7,2) NOT NULL

    )

    -- Load some simplified sample data

    INSERT INTO #transactions (org, account, period, amount)

    SELECT 'SUNWAY GROUP', '2100D', '1/1/2011', 1000.25 UNION

    SELECT 'SUNWAY GROUP', '2100D', '1/1/2011', 150.75 UNION

    SELECT 'SUNWAY GROUP', '4100D', '1/1/2011', 200.50 UNION

    SELECT 'SUNWAY GROUP', '2100D', '2/1/2011', 500.00 UNION

    SELECT 'SUNWAY GROUP', '4100D', '4/1/2011', 22.50 UNION

    SELECT 'ANOTHER ORG', '4100D', '4/1/2011', 212.50 UNION

    SELECT 'ANOTHER ORG', '2100D', '4/1/2011', 15.30 UNION

    SELECT 'ANOTHER ORG', '2100D', '5/5/2011', 11.10 UNION

    SELECT 'SUNWAY GROUP', '2100D', '3/1/2011', 20.25;

    -- *******************************************************************

    -- Get the AR and AP totals by org and month with a single query.

    -- This will only scan the underlying tables once!

    -- (Or avoid scans altogether if the tables are indexed appropriately)

    -- *******************************************************************

    SELECT org,

    DATENAME(MONTH,periodStart) AS pmonth,

    DATEPART(YEAR,periodStart) AS pyear,

    SUM(AR_Amount) AS AR_Total,

    SUM(AP_Amount) AS AP_Total

    FROM (

    -- This derived table makes things a bit cleaner by allowing us to do the periodStart date math only once.

    -- It won't cause multiple scans of the base tables.

    -- It may not be necessary in your situation, if you don't need to adjust "period" at all

    SELECT org,

    DATEADD(MONTH,DATEDIFF(MONTH,0,period),0) AS periodStart,-- Get the first day of the month for the period so that we can group on Month

    CASE WHEN account LIKE '2%D' THEN amount ELSE 0.00 END AS AR_Amount,-- Pivot out the AR amounts based on account type

    CASE WHEN account LIKE '4%D' THEN amount ELSE 0.00 END AS AP_Amount -- Pivot out the AP amounts based on account type

    FROM #transactions

    WHERE account LIKE '2%D' OR account LIKE '4%D' -- Make sure we're only working with AR and AP accounts

    ) AS drv

    WHERE org in ('SUNWAY GROUP','ANOTHER ORG') -- Replace this hardcoded list with a call to your recursive function.

    GROUP BY org, periodStart

    Note that even the use of the derived table may not be necessary; I don't know what kind of dates your "period" column contains. If period is already set to the start date of the period (here I'm assuming first of each month), then you won't need to do the DATEADD/DATEDIFF stuff, and there's not point in using a derived table; you can just SUM() the CASE statements and apply your grouping in a single query.

    Let me know if you have any questions about this approach. Hope it helps!

  • Yingchai, did you have any luck with this? Or are you still having performance problems?

  • Hi JonFox,

    I am putting this issue aside first as I need to rush for other modules...will let you know once I found some workaround on this performance issue...

    TQ.

  • No problem. Did the last method I posted not solve the performance problems, or have you just not had a chance to test it yet?

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply