Improving the performance of a Data Mining proc

  • Hi,

    This issue is not really a problem - I'm more curious to know if there is any way to improve the performance of a stored proc which effectively mines data every morning.

    The situation is the daily rebuild of a reporting table, which is based from a transaction history table containing around 30,000,000 records. The sql to do this is as follows:

    SELECT * INTO reporting_table

    FROM transaction_history th1

    WHERE NOT EXISTS

    (SELECT *

    FROM transaction_history th2

    WHERE th1.field1 = th2.field1 AND

    th1.field2 = th2.field2 AND

    th1.field3 = th2.field3 AND

    th2.trans_date > th1.trans_date

    )

    As you can see, the table is referencing itself to get the last existing record based on the required key(fields). The process takes around 1hr 15min to get approx 1.7 million records.

    Is there a way to improve the performance with an index (index tuning wizard was no real help), or a change in the structure of the sql.

    I've already thought of changing the process to be a cumulative effect instead (finding only those records that haven't been selected already), but I was wondering if there was anything else that could enhance what is currently.

    thanks

  • What do you have indexed? Where is the clustered index? You could optimize some by creating a permanent table and just truncating it each time. Not sure how good 'not exists' is, nots are usually not great. Certainly changing to only process changed records might help. How about using a trigger to keep the reporting table current as you go? Do you really need to select * into the reporting table? How about just putting the primarykey? That would reduce disk io a lot. What kind of hardware do you have?

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • I've tried different indexes with little result. A cluster index isn't really going to help, considering that the history table is updated daily with 5000+ records.

    I don't particularly like "Not Exists", yet the only other way of trying to achieve the same ends is by using a cursor (yuck!).

    Unfortunately the table only has 7 columns and the primary key is a compound of 5 of them. Apart from that, I also need all fields for other purposes.

    Platform is NT 4.0, running on SQL 7 on twin PIII 550 with 780Mb RAM, SCSI equipped.

    I'm experimenting with a cumulative option, and will see how it goes....

  • If you put just the primary key into your table you'd just have to join back to get the other details. Would be interesting to see if it made much difference, you could just add an identity col to your table and index unique no nulls, treat it like a pkey.

    Suggest post the query asking for alternative solutions, pretty sure can be done without not exists without a cursor.

    Not sure I agree about clustered - then again, I havent tested your scenario!

    780m is not a lot. Hows the server holding up during the query? Could you post the query plan?

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • This might be a method to get rid of the NOT EXISTS ... (I think). Don't know about performance though...

    SELECT * INTO reporting_table

    FROM transaction_history th1

    WHERE th1.trans_date =

    (SELECT MAX(th2.trans_date) FROM

    transaction_history th2

    WHERE th1.field1 = th2.field1 AND

    th1.field2 = th2.field2 AND

    th1.field3 = th2.field3)

    Edited by - NPeeters on 08/16/2002 05:03:13 AM

  • cheers,

    gave the max date a whirl, but it was still going after 2 hours, so not much benefit there.

    I tried a cumulative option, but it's harder than I first thought, since every record added may be replacing one already there - ie. it's the last record occurence for that group, so adding records isn't the go.

    I'll post the question on T-sql forum and see what people suggest.

    thanks

  • A different thought.....

    Apprently there is not much can be done but if you follow this method performance will certainly improve......

    SP_PH_WHO2

    SELECT

    FIELD1 + FIELD3 + FIELD3 AS COMB_FIELDS,

    FIELD1,

    FIELD3,

    FIELD3,

    TRANS_DATE

    INTO ##TMP_TRANSACTION_HISTORY

    FROM TRANSACTION_HISTORY

    CREATE CLUSTERED INDEX P_##TMP_TRANSACTION_HISTORY ON ##TMP_TRANSACTION_HISTORY(COMB_FIELDS,TRANS_DATE) WITH PAD_INDEX,FILLFACTOR = 100,STATISTICS_NORECOMPUTE

    SELECT * INTO REPORTING_TABLE

    FROM ##TMP_TRANSACTION_HISTORY TH1

    WHERE NOT EXISTS

    (SELECT *

    FROM ##TMP_TRANSACTION_HISTORY TH2

    WHERE TH1.COMB_FIELDS = TH2.COMB_FIELDS AND

    TH2.TRANS_DATE > TH1.TRANS_DATE

    )

    DROP TABLE ##TMP_TRANSACTION_HISTORY

    This method will not take more than 30 minutes on 3 million records.

    Three things are here

    1. Adding theses columns will not create any problem as you want exactly theses three columns to be matched

    2. Move processing into tempdb as that is usually 0+1 drive and logging is much less there.

    3. Clustered index will make it very fast for comparing

    In REPORTING_TABLE table if you need some extra columns to be added add it in the first step while inserting data into ##tmp table.

    Cheers,

    Prakash

    Prakash Heda
    Lead DBA Team - www.sqlfeatures.com
    Video sessions on Performance Tuning and SQL 2012 HA

  • Note really sure you need the subquery as you have it.

    SELECT * INTO reporting_table

    FROM transaction_history th1

    WHERE th1.trans_date =

    (SELECT MAX(th2.trans_date) FROM

    transaction_history th2

    WHERE th1.field1 = th2.field1 AND

    th1.field2 = th2.field2 AND

    th1.field3 = th2.field3)

    or

    SELECT * INTO reporting_table

    FROM transaction_history th1

    WHERE NOT EXISTS

    (SELECT *

    FROM transaction_history th2

    WHERE th1.field1 = th2.field1 AND

    th1.field2 = th2.field2 AND

    th1.field3 = th2.field3 AND

    th2.trans_date > th1.trans_date

    )

    Looks to me like you want to get the MAX value of th1.trans_date but you are adding overhead by having to match values first.

    Try

    WHERE th1.trans_date = (SELECT MAX(th2.trans_date) FROM transaction_history th2)

    Now the other thing is SELECT INTO is notorious for locking on tables and can lead to other issues.

    Instead it is considered to do INSERT (INTO) instead. Now I assume the reporting_table is already built so you should be able to do

    INSERT reporting_table (collist)

    SELECT collist FROM transaction_history th1

    WHERE th1.trans_date = (SELECT MAX(th2.trans_date) FROM transaction_history th2)

    If this is slow then try do

    SET SHOWPLAN_TEXT ON

    GO

    INSERT reporting_table (collist)

    SELECT collist FROM transaction_history th1

    WHERE th1.trans_date = (SELECT MAX(th2.trans_date) FROM transaction_history th2)

    GO

    SET SHOWPLAN_TEXT OFF

    GO

    And post the execution plan here so we can look at what is happening and offer suggestions.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • Confused, can you explain what exactly are you trying?

    Finding newly added records or for each combination of (fld1 + fld2 + fld3) latest records........

    Cheers,

    Prakash

    Prakash Heda
    Lead DBA Team - www.sqlfeatures.com
    Video sessions on Performance Tuning and SQL 2012 HA

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply