Should I convert my non-clustered index into a clustered index

  • hi guys,

    I am working on a bit of a hybrid system - not OLTP and non OLAP. It is basically a staging database that converts Unix flat files into a dataset for inserting into another SQL database.

    I have a table with 110M rows and about 40 columns, mainly decimal(10,2). 4 times a week we read in 3.8M rows for a new update_date and clear out 3.7M rows for the update_date that was 45 days ago (dataset is growing by 60K rows per month)

    The table currently has three non-clustered indexes;

    IDX1: Acc_no

    IDX2: update_date

    IDX3: Acc_no/Update (Unique)

    I am running a query (I won't bore you with the details) that basically compares the records for one update_date with the records for the previous update_date and writes into another table the records which are different (doing a not equality compare on the other 38 fields) or did not exist in the previous day's data. The idea is that this other table permanently holds records which are significant (i.e. different from the previous run)

    for some reason I am having to force the optimiser to use the IDX3 using the WITH (INDEX(IDX3)) syntax which is fine, but because I am looking at fields not in the index the query plan is showing RID lookups to get the non indexed values. I could create covering fields in the index, but as it is every field I am comparing, the index would be the same size as the table. If I were to convert IDX3 into a clustered index, would this remove the RID lookup.

    I understand there there would be a one-off perfomance hit on the database as the data will be rewritten on the disk and the other two non-clustered indexes will be rebuilt. As the delete will be removing entire pages from the index and the insert will be adding new pages, should I expect a performance hit on the CRUD operations? The data is imported one date at a time, but the account numbers are not in sequence, will this cause page split issues if the data is imported using an SSIS dataflow task

    timings examples for current setup.

    SELECT

    C.update_date,

    C.acc_no,

    P.Update_date

    FROM

    myTable C WITH(INDEX(IDX3))

    JOIN

    myTable P WITH (INDEX(IDX3)) on C.acc_no = P.Acc_no

    AND P.update_date = '20120427'

    WHERE

    C.update_date = '20120430'

    and

    P.update_date is null

    find all records in current date that do not exist in previous date: 2m36s to find 6650 records from 3.8M records. - Using only the index seek

    SELECT

    C.update_date,

    C.acc_no,

    C.Shop_code,

    P.Update_date

    FROM

    myTable C WITH(INDEX(IDX3))

    JOIN

    myTable P WITH (INDEX(IDX3)) on C.acc_no = P.Acc_no

    AND P.update_date = '20120427'

    WHERE

    C.update_date = '20120430'

    and

    P.update_date is null

    find all records (udpate_date,Acc_no, Shop_code) in current date that do not exist in previous date: returned no records after 12 minutes - using index seek and RID lookup because shop_code is not in the index

    My question is: If I convert IDX3 to a clustered index, will I eliminate the RID lookups and therefore be able to access the other fields for the comparisons

  • Just include shop_code in the index, I wouldn't make it clustered, doesn't look like a good candidate for a clustered index

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Gila,

    Shop_code is just an example to make it use the RID lookup. I actually need to check against the other 37 fields as well, which would make the index (covering or otherwise) the same size as the table - in which case I think I would be just as well off making the current unique non clustered index into a clustered index.

  • and clear out 3.7M rows for the update_date that was 45 days ago (dataset is growing by 60K rows per month)

    you might want to have a look at a few articles on the web relating to tables without clustered indexes (heaps)

    http://sqlserverpedia.com/wiki/Heaps

    http://www.sqlservercentral.com/Forums/Topic1020897-146-2.aspx#bm1021099

    when you DELETE from a heap the pages are not released - there are several workarounds (see the first article)

    it's "possible" you have a lot of wasted space in your db - you can check using sp_spaceused

    it's not mandatory to have a clustered index on the table, but make sure you know the gotchas

    MVDBA

  • Michael,

    Part the project I am working on is to reduce the space used for a this database. We started off with 710M records in this table, never deleting any of them. We have configured the process so that we keep the last 45 days transactions (about 110M records) so that current month end reporting still works, but are now archiving the important records into a separate table (about 20M records instead of 760M) Due to a previous delete from this enormous heaped table the database is about 1TB on disk, even though it is actually only about 200GB of actual data.

    This would only be another argument FOR creating a clustered index

  • I have no argument against creating a clustered index.

    Converting the index to clustered fora single query (or set of queries) is not the best approach, there is only one clustered index on a table, and it affects every single nonclustered index.

    Does the update date ever get updated? Or is it static after the row is created?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • @Gila,

    I tend to agree, but I am coming to the conclusion that this application is the exception that proves the rule.

    Data gets imported 3.8M records with the same update date and a unique account_number for the date so update_date/Acc_no is a unique key

    Data gets deleted en-masse for specific days in blocks of 3.8M (or 7.6M or 11.2M) records depending on when the 45 day cut off is. I don't think creating the PUK using update_date as the first element is going to severly affect the (C)reate or (D)elete elements of CRUD. The records NEVER get (U)pdated, they only get (R)ead.

    The meat and potatoes of the process takes the records for the just imported day and pratts about with them in temp tables and converts them into a financial journal which is then stores in other tables and inserts into a linked server which is the accounting system which then does whatever it does with them.

    Apart from DQ queries and some poorly designed month end reporting, the records are never touched again.

    I am trying to stop the database being filled with redundant historical data; 90% of the records are identical to the previous run except for their update_date. The challenge is that I need to compare all of today's records with all of yesterday's records and compare on all 38 fields that are not part of the unique key.

    Once the clustered index is created the C/D overhead is minimal (I think) as I am always adding to the end of the index and deleting from the head of the index. I have tried this on my DEV database which only has 11M records rather than 110M and with some tweaking of the stored proc got the time down from 40 minutes for processing to 5m30s; however as this process is designed to run overnight, the actual timings are not that critical as long as it doesn't take hours!.

    Running just for non-existent records (i.e. Previous is null) in Live without the index retured 2818 records in 28 minutes and in DEV with the clustered index returned 14528 records in 47s. DEV also returned a full analysis set (245K records) in 5m42s so it is proving that the query is much more efficient with the clustered index.

  • If the records never get updated, then I would suggest a clustered index on update date, account number. That should be close to ever-increasing, far more than the other way around would be.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • @Aaron,

    i have couple of questions ?

    1) how much time does single run take(single execution) ?

    2) Whats the frequency of single run ?

    3) how much data single run handling?

    4) how you are migrating the data ? SSIS ? JOB?

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • GilaMonster (5/2/2012)


    Just include shop_code in the index, I wouldn't make it clustered, doesn't look like a good candidate for a clustered index

    Gail

    i didn't get your point here,why IDX3 couldn't be candidate for Clus index. as the above queries always pulling out the data on its basis.Please explain:-)

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Bhuvnesh (5/4/2012)


    @aaron,

    i have couple of questions ?

    1) how much time does single run take(single execution) ?

    2) Whats the frequency of single run ?

    3) how much data single run handling?

    4) how you are migrating the data ? SSIS ? JOB?

    1) this seems to vary, even though there is no load on the server (we are the only ones using it) but it averages about 90 minutes and seems to be fully conditional on the amount of data in the table (I am almost always looking at the two most recent data entry days

    2) currently 4 times a week. They woudl like to go back to daily

    3) roughly 8M rows (compare 4M from today with 4M from yesterday)

    4) Data is coming in via SSIS from a pipe delimited file but goes into a staging table before hitting my main table, so I guess the relvant answer is an INSERT INTO myTable (fields,...) Values (...,...) no data transformations are done at this point; the staging table is the same structure as my target table (don't ask...)

    @Gila, I have made IDX3 into a clustered table - I don't know how long it took but it was still running after 7 1/2 hours.

    We will see next week what the performance effect is.

  • you can keep your historical /old data on other disk by setting some archival job(hourly basis or may be trigger) and later use these two different tables to fetch current date data plus yesterday's data (old data)

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply