Index Creation is Slow After Duplicate

  • EDIT: Sorry - The title should be "Index Creation Slow After Duplicates Removed" :blush:

    Hello All,

    We have a table with 14million rows that we load data into 3 times a day for reporting. It is loaded using an ETL tool.

    I recently discovered that, due to an ETL tool being used to maintain the indexing on the table (in terms of drop and recreate) there were twice as many indexes as there were intended to be. I think there were duplicate indexes present that were not being maintained actively. I therefore suggested that these indexes be removed to free space and improve query performance since these fragmented indexes were being used in live queries.

    The ETL process is designed to drop data in the target table, drop the indexes, insert the data, and finally rebuild the indexes. Due to the ETL process not dropping all the indexes i concluded that the insert operation would be impaired. My solution was to simply drop the duplicate indexes and allow everything else to continue as 'normal'.

    Simple i thought! The insert times should be reduced now they would be free from the index maintenance overhead.

    When we've tested this theory, the data insert is certainly quicker... gone from 20 mins to insert 24000 rows to under 1 minute. Great start! But then after the rows have been inserted the Creation of the indexes took an hour whereas before they took 6mins.

    Does anyone understand why this could be? And more to the point, how we can resolve this mess?

    One thought I had was that the duplicate/fragmented indexes were actually supporting the recreation of the 'proper' indexes after the rows were inserted?

    Thanks for reading... please accept my apologies for any poor terminology etc. since i'm not trained in SQL and am simply trying to improve performance of a data warehouse of a company I've just joined. Be gentle! 🙂

  • Chris try disable/rebuild index it should perform faster than drop/create.

    I don't know your underlying disk structure but for best performance you should separate all your indexes from the data file (mdf) and place it on a separate disk.

    Alex S
  • Thanks for the suggestion Alex.

    I've tried a few tests using the Disable/Rebuild method instead of Drop/Create and found some very different performances:

    1st execution took 22 mins to rebuild a single index and subsequent executions took on average 2:15.

    With the drop/create method the results were averaging 3mins (i ran the drop/create ones after the rebuilds)

    Could the reduced time of the latter executions be down to pages being in memory etc?

    If so, does this certainly point to disk system performance issues? Any idea how i can tell if the disks are the bottleneck on this particular test case? (I guess that is a huge area!). I noticed that in Activity Monitor that the process performing the Alter index was producing pageiolatch_sh wait type throughout. would it be doing this if the disks are healthy?

    Our company does not have a dedicated DBA resource so it falls on us (Developers/Application owners) to optimize and make suggestions to the Technical Team. (So long as the servers are running and patched etc. they are happy)

    Thanks again,

    Chris

  • When inserting the data , are u performing an ordered insert or do you have a clustered index performing sort operations?

    the above post has already been discussed below

    http://www.sqlservercentral.com/Forums/Topic531822-360-1.aspx

    http://msdn.microsoft.com/en-us/library/ms189858.aspx

    Jayanth Kurup[/url]

  • I don't think it is performing an ordered insert nor is there a clustered index performing a sort. The table is used a fact table in our data warehouse with no PK or clustered index etc. since all referential integrity is controlled within the ETL tool.

    It's not my design but i'm keen to improve it in any way i can.

    Regards,

    Chris

  • this may sound a daft idea but have you tried reverting everything back to how it was? This way you can determine for sure whether it was actually your changes which has caused the "slowness" of the index rebuild as opposed to something else running on the server at the same time for example......

    Could it be something simple like as soon as the data has been loaded someone/some process is accessing the table and blocking the index from being created?

    _________________________________________________________________________________SQLGeordieWeb:- Jarrin ConsultancyBlog:- www.chrisjarrintaylor.co.ukTwitter:- @SQLGeordie

  • Jayanth_Kurup (7/29/2011)


    When inserting the data , are u performing an ordered insert or do you have a clustered index performing sort operations?

    the above post has already been discussed below

    http://www.sqlservercentral.com/Forums/Topic531822-360-1.aspx

    http://msdn.microsoft.com/en-us/library/ms189858.aspx

    Chris,

    I'm not sure if I have the sequence correct here, but it seems that in the test the data was loaded, then an index rebuild, then a drop and create index. Once the table is loaded and you've built or rebuilt the index the first time, the table will be in clustered index order already for the next rebuild or drop and create. This could lead to less sorting of the clustered index since the table is already in that order.

    To get a proper test comparison of the different methods, you would have to truncate the table and re-load the data.

    Todd Fifield

  • Didn't read the whole thing, but I need to point out that you might want to rebuild the stats (that are outside the rebuilt index).

    For this reason : http://sqlinthewild.co.za/index.php/2011/03/22/statistics-row-estimations-and-the-ascending-date-column/

  • ChrisTaylor (8/2/2011)


    this may sound a daft idea but have you tried reverting everything back to how it was? This way you can determine for sure whether it was actually your changes which has caused the "slowness" of the index rebuild as opposed to something else running on the server at the same time for example......

    Could it be something simple like as soon as the data has been loaded someone/some process is accessing the table and blocking the index from being created?

    We ran an initial data load before we made the change and the index build times were 'normal' in the range of a minutes each. We then dropped the duplicated indexes and tried again which is when we found the slow rebuild times for the indexes as part of another data load (the same one we performed earlier).

    In terms of something else on the server at the same time, i cannot discount it but I think it is unlikely as we are the only ones with access etc. And as far as i can see there are no blocks etc. it just seems that it is incredibly slow the first time. This is a test server though and i don't think it has been optimised in terms of storage (not sure if that makes it the best test server or not :))

    Any guidance as to how I can monitor what SQL server is doing and performing during an index creation would be greatly appreciated.

    Chris

  • tfifield (8/2/2011)


    Chris,

    I'm not sure if I have the sequence correct here, but it seems that in the test the data was loaded, then an index rebuild, then a drop and create index. Once the table is loaded and you've built or rebuilt the index the first time, the table will be in clustered index order already for the next rebuild or drop and create. This could lead to less sorting of the clustered index since the table is already in that order.

    To get a proper test comparison of the different methods, you would have to truncate the table and re-load the data.

    Todd Fifield

    Hi Todd,

    That makes sense to me thanks but (forgive me if I've not got a good enough understanding of indexes here) there are no clustered indexes on the table... only non-clustered. There are no constraints on the table since integrity is ensured through the ETL tool. So to me, the actual table would be considered a heap? I've checked the table manually by doing a "select * from" and the data is returned un-ordered... i cannot see any column that has a sort on it.

    To my amateur mind, this would certainly impact index creation performance if additional sort operations are required.

  • Ninja's_RGR'us (8/2/2011)


    Didn't read the whole thing, but I need to point out that you might want to rebuild the stats (that are outside the rebuilt index).

    For this reason : http://sqlinthewild.co.za/index.php/2011/03/22/statistics-row-estimations-and-the-ascending-date-column/

    Thanks for your link... interesting reading! All of the indexes on the table in question are dropped and reloaded as part of the process. Would this also mean that all statistics would be updated at that time or would the auto-update not be triggered?

    Chris

  • chris.davidson (8/3/2011)


    Ninja's_RGR'us (8/2/2011)


    Didn't read the whole thing, but I need to point out that you might want to rebuild the stats (that are outside the rebuilt index).

    For this reason : http://sqlinthewild.co.za/index.php/2011/03/22/statistics-row-estimations-and-the-ascending-date-column/

    Thanks for your link... interesting reading! All of the indexes on the table in question are dropped and reloaded as part of the process. Would this also mean that all statistics would be updated at that time or would the auto-update not be triggered?

    Chris

    That's 100% change, auto-update has to kick in.

    A couple more things. You say the process is slow the first time and then fast ALL other times. That is usually because the disks are slow getting data to ram. Then it's fast because ram speed is way faster than the disks.

    A final point is about heaps. Due to there nature it can be causing you problems. I can't recommend enough to put a clustered index on that table.

    The worse case I've seen is a heap with ±190 rows taking 15 GB, yes GB, of blank pages in the db (NO BLOBS). Heaps don't always reuse blank page. This is the nature of the heap, that's why MS recommends strongly to use clustered indexes (identity is fine if you don't have a real key... but there may be better options out there).

    Before adding the index I'd like if you could run that script for that table and post the results back... I want to see if that script can pick up that issue.

    /* http://jasonbrimhall.info/2010/05/19/tablespace-update/ */

    SET IMPLICIT_TRANSACTIONS OFF

    SET NOCOUNT ON

    /* Part I */

    --Drop Table #indstats

    IF exists (SELECT * FROM tempdb.sys.objects WHERE name like '%#indstats%')

    BEGIN

    DROP TABLE tempdb.dbo.#indstats

    END

    BEGIN

    CREATE TABLE #indstats (

    IndStatsID INT PRIMARY KEY CLUSTERED

    ,database_id BIGINT

    ,index_id BIGINT

    ,IndexSizeMB DECIMAL(16,1)

    ,OBJECT_IDBIGINT

    );

    END

    INSERT INTO #indstats (IndStatsID,database_id,index_id,OBJECT_ID,IndexSizeMB)

    SELECT Row_Number() OVER (ORDER BY OBJECT_ID) AS IndStatsID

    ,database_id,index_id,OBJECT_ID

    ,CONVERT(DECIMAL(19,2),(SUM(ps.page_count))) * 8 /1024 AS IndexSizeMB

    FROM sys.dm_db_index_physical_stats(DB_ID(),null,NULL,NULL,'DETAILED') ps

    GROUP BY database_id,OBJECT_ID,index_id;

    /* Part II */

    DECLARE @dbsize DECIMAL(19,2)

    SET NOCOUNT ON

    /*

    ** Summary data.

    */

    BEGIN

    SELECT @dbsize = SUM(CONVERT(DECIMAL(19,2),CASE WHEN type = 0 THEN SIZE ELSE 0 END)) * 8/1024

    FROM sys.database_files

    END

    /* Part III */

    BEGIN

    WITH RegData AS (

    SELECT a.container_id,p.OBJECT_ID,p.index_id,us.database_id

    ,FileGroupName = FILEGROUP_NAME(a.data_space_id)

    ,TableName = OBJECT_NAME(p.OBJECT_ID)

    ,NumRows = p.ROWS

    ,UsedPages = IsNull(a.used_pages,0)

    ,TotalPages = IsNull(a.total_pages,0)

    ,DataSizeMB = CONVERT(DECIMAL(19,2),IsNull(a.used_pages,0)) * 8/1024

    ,IndexSizeMB = CASE WHEN ps.index_id < 2 THEN 0 ELSE ps.IndexSizeMB END

    ,UserRequests = IsNull(us.user_seeks,0) + IsNull(us.user_scans,0) + IsNull(us.user_lookups,0)

    ,UserUpdates = IsNull(us.user_updates,0)

    ,LastUpdate = IsNull(us.last_user_update,null)

    ,RatioRequestsToUpdates = CAST(IsNull(us.user_seeks,0)

    + IsNull(us.user_scans,0) + IsNull(us.user_lookups,0) AS REAL)

    / CAST(CASE us.user_updates WHEN 0 THEN 1 ELSE us.user_updates END AS REAL)

    FROM sys.allocation_units a

    INNER Join sys.partitions p

    ON p.hobt_id = a.container_id

    And a.type = 1

    LEFT Outer Join sys.dm_db_index_usage_stats us

    ON us.OBJECT_ID = p.OBJECT_ID

    And us.index_id = p.index_id

    And us.database_id = DB_ID()

    LEFT Outer Join #indstats ps

    ON p.index_id = ps.index_id

    And ps.database_id = DB_ID()

    And p.OBJECT_ID = ps.OBJECT_ID

    --WHERE OBJECTPROPERTY(p.OBJECT_ID,'IsMSShipped') = 0

    )

    , LOBData AS (

    SELECT a.container_id,p.OBJECT_ID,p.index_id,us.database_id

    ,FileGroupName = FILEGROUP_NAME(a.data_space_id)

    ,TableName = OBJECT_NAME(p.OBJECT_ID)

    ,NumRows = p.ROWS

    ,UsedPages = IsNull(a.used_pages,0)

    ,TotalPages = IsNull(a.total_pages,0)

    ,DataSizeMB = CONVERT(DECIMAL(19,2),IsNull(a.used_pages,0)) * 8/1024

    ,IndexSizeMB = CASE WHEN ps.index_id < 2 THEN 0 ELSE ps.IndexSizeMB END

    ,UserRequests = IsNull(us.user_seeks,0) + IsNull(us.user_scans,0) + IsNull(us.user_lookups,0)

    ,UserUpdates = IsNull(us.user_updates,0)

    ,LastUpdate = IsNull(us.last_user_update,null)

    ,RatioRequestsToUpdates = CAST(IsNull(us.user_seeks,0)

    + IsNull(us.user_scans,0) + IsNull(us.user_lookups,0) AS REAL)

    / CAST(CASE us.user_updates WHEN 0 THEN 1 ELSE us.user_updates END AS REAL)

    FROM sys.allocation_units a

    INNER Join sys.partitions p

    ON p.partition_id = a.container_id

    And a.type = 2

    LEFT Outer Join sys.dm_db_index_usage_stats us

    ON us.OBJECT_ID = p.OBJECT_ID

    And us.index_id = p.index_id

    And us.database_id = DB_ID()

    LEFT Outer Join #indstats ps

    ON p.index_id = ps.index_id

    And ps.database_id = DB_ID()

    And p.OBJECT_ID = ps.OBJECT_ID

    --WHERE OBJECTPROPERTY(p.OBJECT_ID,'IsMSShipped') = 0

    )

    , OverFlowData AS (

    SELECT a.container_id,p.OBJECT_ID,p.index_id,us.database_id

    ,FileGroupName = FILEGROUP_NAME(a.data_space_id)

    ,TableName = OBJECT_NAME(p.OBJECT_ID)

    ,NumRows = p.ROWS

    ,UsedPages = IsNull(a.used_pages,0)

    ,TotalPages = IsNull(a.total_pages,0)

    ,DataSizeMB = CONVERT(DECIMAL(19,2),IsNull(a.used_pages,0)) * 8/1024

    ,IndexSizeMB = CASE WHEN ps.index_id < 2 THEN 0 ELSE ps.IndexSizeMB END

    ,UserRequests = IsNull(us.user_seeks,0) + IsNull(us.user_scans,0) + IsNull(us.user_lookups,0)

    ,UserUpdates = IsNull(us.user_updates,0)

    ,LastUpdate = IsNull(us.last_user_update,null)

    ,RatioRequestsToUpdates = CAST(IsNull(us.user_seeks,0)

    + IsNull(us.user_scans,0) + IsNull(us.user_lookups,0) AS REAL)

    / CAST(CASE us.user_updates WHEN 0 THEN 1 ELSE us.user_updates END AS REAL)

    FROM sys.allocation_units a

    INNER Join sys.partitions p

    ON p.hobt_id = a.container_id

    And a.type = 3

    LEFT Outer Join sys.dm_db_index_usage_stats us

    ON us.OBJECT_ID = p.OBJECT_ID

    And us.index_id = p.index_id

    And us.database_id = DB_ID()

    LEFT Outer Join #indstats ps

    ON p.index_id = ps.index_id

    And ps.database_id = DB_ID()

    And p.OBJECT_ID = ps.OBJECT_ID

    --WHERE OBJECTPROPERTY(p.OBJECT_ID,'IsMSShipped') = 0

    ), IndexSum AS (

    SELECT a.OBJECT_ID

    ,AllDataSizeMB = SUM(CASE WHEN a.index_id in (0,1)

    THEN IsNull(a.DataSizeMB,0) + IsNull(p2.DataSizeMB,0) + IsNull(p3.DataSizeMB,0)

    ELSE IsNull(p2.DataSizeMB,0) + IsNull(p3.DataSizeMB,0)

    END)

    FROM RegData a

    LEFT Outer Join LOBData p2

    ON p2.container_id = a.container_id

    LEFT Outer Join OverFlowData p3

    ON p3.container_id = a.container_id

    GROUP BY a.OBJECT_ID

    ), SummaryInfo AS (

    SELECT

    TableName = MAX(a.TableName)

    ,InRowDataSizeMB = SUM(IsNull(a.DataSizeMB,0))

    ,LOBDataSizeMB = SUM(IsNull(p2.DataSizeMB,0))

    ,OFlowDataSizeMB = SUM(IsNull(p3.DataSizeMB,0))

    ,NumRows = MAX(COALESCE(a.NumRows,p2.NumRows,p3.NumRows,0))

    ,AllUsedPages = SUM(IsNull(a.UsedPages,0) + IsNull(p2.UsedPages,0) + IsNull(p3.UsedPages,0))

    ,AllPages = SUM(IsNull(a.TotalPages,0) + IsNull(p2.TotalPages,0) + IsNull(p3.TotalPages,0))

    ,FreeDataSpaceMB = CONVERT(DECIMAL(19,2),

    SUM(IsNull(a.TotalPages,0) + IsNull(p2.TotalPages,0) + IsNull(p3.TotalPages,0))

    - SUM(IsNull(a.UsedPages,0) + IsNull(p2.UsedPages,0) + IsNull(p3.UsedPages,0)))* 8 / 1024

    ,AllDataSizeMB = MAX(ids.AllDataSizeMB)

    ,IndexSizeMB = SUM(IsNull(a.IndexSizeMB,0))

    + SUM(IsNull(p2.IndexSizeMB,0)) + SUM(IsNull(p3.IndexSizeMB,0))

    ,UserRequests_Cnt = AVG(IsNull(a.UserRequests,0)

    + IsNull(p2.UserRequests,0) + IsNull(p3.UserRequests,0))

    ,UserUpdates_Cnt = AVG(IsNull(a.UserUpdates,0) + IsNull(p2.UserUpdates,0) + IsNull(p3.UserUpdates,0))

    ,LastUpdate = MAX(COALESCE(a.LastUpdate,p2.LastUpdate,p3.LastUpdate,null))

    ,DatabaseSize = @dbsize

    FROM RegData a

    LEFT Outer Join LOBData p2

    ON p2.container_id = a.container_id

    LEFT Outer Join OverFlowData p3

    ON p3.container_id = a.container_id

    LEFT Outer Join sys.indexes i

    ON i.OBJECT_ID = a.OBJECT_ID

    And i.index_id = a.index_id

    LEFT Outer Join IndexSum ids

    ON i.OBJECT_ID = ids.OBJECT_ID

    --WHERE filegroup_name(a.data_space_id) = 'Primary'

    GROUP BY a.OBJECT_ID

    )

    SELECT TableName,NumRows,InRowDataSizeMB,LOBDataSizeMB,OFlowDataSizeMB

    ,AllUsedPages,AllPages

    ,FreeDataSpaceMB,AllDataSizeMB,IndexSizeMB

    ,TableSizeMB = AllDataSizeMB + IndexSizeMB + FreeDataSpaceMB

    ,UserRequests_Cnt,UserUpdates_Cnt,LastUpdate

    ,PercentofDB = ((IndexSizeMB + AllDataSizeMB) / DatabaseSize) * 100

    ,DatabaseSize

    FROM SummaryInfo

    ORDER BY PercentofDB DESC, NumRows DESC

    END

  • I can't agree more with Ninja's comments on tables needing clustered indexes. A quick way to look at the tables indexes is:

    sp_helpindex 'TableName'

    If there really is no clustered index and there is no reason to create one since it's an ETL type table, then you may be better off dropping and re-creating the table between each data load. That way you will know for certain that the unused space is given back to SQL Server.

    I've also seen a single heap table that was being pounded constantly end up taking half of the whole database size.

    Todd Fifield

  • How many GBs for the heap the db?

  • Remi,

    The total size of the DB was around 2 GB. The culprit was the main table used for orders and installations. It's non-clustered PK was a GUID (ouch!). The table was constantly inserted and constantly updated and it had more than a few VARCHAR columns that would be empty on insert and then filled on UPDATE, thus causing page splits. Somewhat regularly records were deleted, thus creating empty pages (since no clustered index).

    It took about 6 months for the table to take up half the space of the database. It was a bloody mess.

    Todd Fifield

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply