Could this be done better?

  • erikd (11/25/2013)


    Lynn Pettis (11/25/2013)


    Five, it would be interesting to see how your solution works when adapted to use the actual data source.

    Ask and ye shall receive.

    The first query plan is using a table variable. It took 386ms.

    The second query skips the table variable and performs the query on the actual table. It took 26ms.

    Which is funny, because I read a lot about pivots being slowpokes, and tend to avoid them.

    Based on the forum you are posting in, I am assuming you are using SQL Server 2012. Perhaps they have made improvements to PIVOT. Always worth testing things like this when you upgrade. Also, it could be due to the small size of the PIVOT and data set. We get back to the It depends and test, test, and test again.

  • I notice it thinks you need a nonclustered index as well.

  • I suppose it's also partially because I've never successfully written a pivot query without cursing and writing it another way. 😛

    Also, I could make an index here, but these staging tables are in use for, at the very most, 6 months, but typically about a month. Ditto the "live" tables the data gets moved to. If I made an index on every table the query analyzer wanted, I'd probably have a very unhappy sys admin. There are ~1700 tables under this db. Heh.

  • > One, using a table variable isn't necessarily a good choice for testing. No matter how many rows of data exist, the optimizer will assume 1 row as there are no statistics

    > kept on a table variable.

    I did not suggest to use a table variable. That was a quick way for me to add the index I am talking about to support the ranking.

    As for the statistics, what you stated is true during compilation time as it is for any other variable. Things are different if we use OPTION (RECOMPILE).

    > Two, looking at the execution plans based on the small data set, both your pivot solution and my solution are identical.

    When I say "pivoting" it doesn't mean to use PIVOT operator literally, since what you are doing in your code is pivoting too (grouping, spreading and aggregating).

    > Three, based on the sample data and the code originally developed I am making an assumption that each of the 3 values for each country indicate small, medium, large. The

    > purpose of the DENSE_RANK function is to conform the data to the same set of values that represents small, medium, and large.

    I am doing the same assumption but I aggregate first and then identify each one of those three values. The difference could depend in the number of duplicates or ties we have by (country, quotacell).

    > Four, as far as indexing on the actual data source the OP may not have control over that as there may be other data values as well and the

    > existing index(es) may be needed for other purposes.

    I am just making the point that if you can have an index to support the ranking (at least partition by, order by, included columns), it will help the performance of the SELECt statement.

    > Five, it would be interesting to see how your solution works when adapted to use the actual data source.

    Hopefuly the OP will let us know.

    Anyway, I am not saying my suggestion is better than any other. It is up to the OP to test and choose the one that better adapt to his/her needs.

  • hunchback (11/25/2013)


    > One, using a table variable isn't necessarily a good choice for testing. No matter how many rows of data exist, the optimizer will assume 1 row as there are no statistics

    > kept on a table variable.

    I did not suggest to use a table variable. That was a quick way for me to add the index I am talking about to support the ranking.

    As for the statistics, what you stated is true during compilation time as it is for any other variable. Things are different if we use OPTION (RECOMPILE).

    > Two, looking at the execution plans based on the small data set, both your pivot solution and my solution are identical.

    When I say "pivoting" it doesn't mean to use PIVOT operator literally, since what you are doing in your code is pivoting too (grouping, spreading and aggregating).

    > Three, based on the sample data and the code originally developed I am making an assumption that each of the 3 values for each country indicate small, medium, large. The

    > purpose of the DENSE_RANK function is to conform the data to the same set of values that represents small, medium, and large.

    I am doing the same assumption but I aggregate first and then identify each one of those three values. The difference could depend in the number of duplicates or ties we have by (country, quotacell).

    > Four, as far as indexing on the actual data source the OP may not have control over that as there may be other data values as well and the

    > existing index(es) may be needed for other purposes.

    I am just making the point that if you can have an index to support the ranking (at least partition by, order by, included columns), it will help the performance of the SELECt statement.

    > Five, it would be interesting to see how your solution works when adapted to use the actual data source.

    Hopefuly the OP will let us know.

    Anyway, I am not saying my suggestion is better than any other. It is up to the OP to test and choose the one that better adapt to his/her needs.

    Actually, he already tried and we saw that your solution worked very well. We both have had experiences where PIVOT was actually slower and that wasn't the case here. We were wondering if it was because of the small size of the pivot or if perhaps there have been some changes to PIVOT from 2008 to 2012 that may also have sped it up.

    PIVOT hasn't always been my first thought either as I still have to use BOL every time I do use and when I put together my solution I did it quickly from my head and used what I readily knew.

    Good solution, I liked what I saw as I reviewed the plan. I may start looking at PIVOT a bit more on 2012 to see how it performs.

  • I'm going to keep both of your solutions on hand and keep testing as more data gets added. I'm also interested to see if pivot holds up. I'll post back here over the next couple weeks with results.

  • If there is any improvement in my suggestion for sure it is not because I used PIVOT operator. The same pivoting was done in Lynn's suggestion (grouping, spreading and aggregating). The difference could be due to the reduce in the amount of rows processed by the ranking because I am filtering and grouping first so the number of rows consumed by the ranking as the output of previous steps could be less.

    In the suggestion from Lynn we are ranking, let us say 1000 rows, and then we group / aggregate on the same amount of rows. In my suggestion, I could be aggregating the same 1000 rows but the output could be less so the ranking is done on less rows and that could make a difference.

    Last but not less is the presence of the clustered index (orc_country, orc_quotacell, sk) which help the optimizer from introducing a "sort" iterator because if we read the data using this index then there is no need to sort the data for the grouping neither for the ranking.

    Here is same suggestion but using the old fashion way (my favorite) of doing the pivoting. Notice that we are filtering (statusflag = 1), then grouping / aggregating and then enumerating the result (ranking). The last step is povoting the data to transform rows into columns.

    WITH C1 AS (

    SELECT

    orc_country,

    orc_quotacell,

    COUNT(*) AS cnt,

    ROW_NUMBER() OVER(PARTITION BY orc_country ORDER BY orc_quotacell) AS rn

    FROM

    @T

    WHERE

    statusflag = 1

    GROUP BY

    orc_country,

    orc_quotacell

    )

    SELECT

    orc_country,

    SUM(cnt) AS Total,

    SUM(CASE WHEN rn = 1 THEN cnt ELSE 0 END) AS Small,

    SUM(CASE WHEN rn = 2 THEN cnt ELSE 0 END) AS Medium,

    SUM(CASE WHEN rn = 3 THEN cnt ELSE 0 END) AS Large

    FROM

    C1

    GROUP BY

    orc_country;

    I would suggest this book if you are interested in learning more about the window functions.

    Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions

  • Howdy,

    As promised, here's the queries you wrote against full data. I decided to add the index suggested by the query plan for the query with a pivot. It ended up having an adverse impact on the execution time. It didn't change the execution time for any of the other queries.

    Lynn: 260ms

    Hunch Old Fashioned: 40ms

    Hunch Pivot: 36ms

    Hunch Pivot w/index: 63ms

    Let me know if I can provide anything else.

    Thanks again

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply