Self join performance improvement

  • I have a stored procedure with a self join that after 48 hours still does not complete.  I would appreciate any helpful tips on how to improve performance.

     

    I am working with end of day stock market data.  I want to know when a stock’s price has doubled within two months.  I want the relevant data associated with this doubling of price saved to another table. 

     

    The source table, Fact_Asset_EOD, contains end of day stock market data. It contains over 8,000,000 records. The structure of the data is:

    Stock_EOD_sk (int)

    Stock_sk (int)

    Trading_Date (smalldatetime)

    Open_Price (money)

    High_Price (money)

    Low_Price (money)

    Close_Price (money)

     

    Stock_EOD_sk is the primary key and identity = yes

    There is a clustered key on Stock_EOD_sk and a Unique Index on Stock_sk and Trading_Date_sk.

     

    Each stock is represented by an integer in Stock_sk (the stock symbol and name are stored in another table) and each trading date for the stock is represented by Trading_Date.

     

    The following stored procedure finds all stocks and all possible trading dates where the price has doubled within 2 months.

     

    select

                pole_bottom.Stock_sk

                , pole_bottom.Trading_Date as 'pole_bottom_Date'

                , pole_bottom.Low_Price as 'pole_bottom_Price'

                , pole_top.Trading_Date as 'pole_top_Date'

                , pole_top.High_Price as 'pole_top_Price'

    into Summary_Flag

    from dbo.Fact_Asset_EOD pole_bottom

                inner join dbo.Fact_Asset_EOD pole_top

                on pole_bottom.Stock_sk = pole_top.Stock_sk

    where

                pole_top.High_Price >= 2 * pole_bottom.Low_Price

                and

                pole_bottom.Trading_Date between

                            dateadd(m, -2, pole_top.Trading_Date)

                            and pole_top.Trading_Date

     

    Any suggestions on how to improve performance?

     

    Cheers

     

    banocto

  • Which indexes are currently used by your query (view the queryanalyzer)?

    Does the index wizard gives any hints?

     

    select

                pole_bottom.Stock_sk

                , pole_bottom.Trading_Date as 'pole_bottom_Date'

                , pole_bottom.Low_Price as 'pole_bottom_Price'

                , pole_top.Trading_Date as 'pole_top_Date'

                , pole_top.High_Price as 'pole_top_Price'

    into Summary_Flag

    from dbo.Fact_Asset_EOD pole_bottom

                inner join dbo.Fact_Asset_EOD pole_top

                on pole_bottom.Stock_sk = pole_top.Stock_sk

                and 

                pole_top.High_Price >= ( pole_bottom.Low_Price + pole_bottom.Low_Price)

                and   pole_bottom.Trading_Date between  dateadd(m, -2, pole_top.Trading_Date)

                            and pole_top.Trading_Date

  • The SELECT INTO is one problem, this is much slower than CREATE TABLE and INSERT INTO SELECT.

    Try:

    -- CREATE TABLE Summary_Flag with no rows

    SELECT pole_bottom.Stock_sk

            , pole_bottom.Trading_Date as 'pole_bottom_Date'

            , pole_bottom.Low_Price as 'pole_bottom_Price'

            , pole_top.Trading_Date as 'pole_top_Date'

            , pole_top.High_Price as 'pole_top_Price'

    INTO Summary_Flag

    FROM dbo.Fact_Asset_EOD pole_bottom

            INNER JOIN dbo.Fact_Asset_EOD pole_top

                ON pole_bottom.Stock_sk = pole_top.Stock_sk

    WHERE pole_bottom.Stock_EOD_sk = 0

    -- INSERT the data into Summary_Flag

    INSERT INTO Summary_Flag (Stock_sk,pole_bottom_Date,pole_bottom_Price,pole_top_Date,pole_top_Price)

    SELECT pole_bottom.Stock_sk

            , pole_bottom.Trading_Date as 'pole_bottom_Date'

            , pole_bottom.Low_Price as 'pole_bottom_Price'

            , pole_top.Trading_Date as 'pole_top_Date'

            , pole_top.High_Price as 'pole_top_Price'

    FROM dbo.Fact_Asset_EOD pole_bottom

            INNER JOIN dbo.Fact_Asset_EOD pole_top

                ON pole_bottom.Stock_sk = pole_top.Stock_sk

                    AND (pole_bottom.Low_Price+pole_bottom.Low_Price) <= pole_top.High_Price

    WHERE pole_bottom.Trading_Date BETWEEN DATEADD(m,-2,pole_top.Trading_Date)

            AND pole_top.Trading_Date

    Andy

  • [snip]

    Posting errored, so tried again and see two copies!

    Andy

  • I would estimate that you would see better performance by doing the following:

     

    Create computed columns on the table with the folowing:

    Price * 2

    and

    date + 2 months

    Then place indexes on those computed columns.  This will "materialize" the calculations, perfoming them ONCE for each of the 8,000,000 rows.  (This will also be done ahead of time)

    Rewrite the query so that the comparisons are performed against the pre-computed columns.  It should run much faster. 

    Currently, the execution is most likely performing the price and date calculations one time for each possible pair of columns in the self joined table, which would be approximately

    (x^2)/y where x is 8,000,000 and y is the number of distinct stock ids.

    So, if you have 8,000 data points for each of 1,000 stocks, you will perform the calculcations 64 billion times.

     

     

  • Thanks Jo, Andy & Jeff.  I’ve been out of action … hence my delayed response. 

     

    I have improved the performance significantly by implementing each of your suggestions and I've learnt a fair bit more about Execution Plans etc.

     

    Thanks

     

    BanOcto

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply