Indexes in warehouse data load

  • I need a data load to a data warehouse to run as quickly as possible whilst not affecting daytime query performance (nothing unusual there).  Most tables have at least one non-clustered index.  Because the data load is performed once nightly and (ignoring a few tables treated slightly differently) they are truncated then filled from scratch, then remain "read only" (i.e. daytime they only receive SELECTs), the FILLFACTOR on the indexes is set to 100.

    From my understanding, this FILLFACTOR will not be dynamically maintained by SQL Server.  This may not be the case because the table is being emptied, and then refilled in one INSERT, but I have ran with the assumption that it won't be maintained.  If I am wrong about this, then my next actions become redundant.

    On this assumption, the index needs to be recreated after the data load to ensure query performance.  I can do this using CREATE INDEX ... DROP_EXISTING, but then the index would be created twice (once during the data load, then again for the CREATE INDEX statement).  So, my answer so far is to drop the index, do the INSERT, then recreate it.

    Is this the best approach or am I making much ado about nothing?

  • When we load our datawarehouse, almost everything is dropped first, then we BCP the data in and build the indexes and the rest at last. I think this is the most effective way to load huge masses of data.

    Here's an interesting part from BOL (creating an index):

    If a clustered index is created on a table with several secondary indexes, all of the secondary indexes must be rebuilt so that they contain the clustering key value instead of the row identifier (RID). Likewise, if a clustered index is deleted on a table that has several nonclustered indexes, the nonclustered indexes are all rebuilt as part of the DROP operation. This may take significant time on large tables.

    The preferred way to build indexes on large tables is to start with the clustered index and then build the nonclustered indexes. When dropping all indexes, drop the nonclustered indexes first and the clustered index last. That way, no indexes need to be rebuilt.

    And yes, WITH DROP EXISTING can speed things up.

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I also dropped all indices before importing data.

    The difference in performance is very noticeable. If time allowed I also used to stop MSSQLSERVER and SQLSERVERAGENT and run a disk defrag.

    I once tried it without dropping the indices and it took 14 hours. With a drop, BCP and CREATE it took 5 hours.

    My understanding is that FILLFACTOR only affects the last active leaf of the index. If a new leaf is created then it will get a FILLFACTOR of 0/100 in any case.

    My understanding of PAD_INDEX is that it leaves space in the branches of the index that allow inserts.

  • You could always set up incramental loading on your larger tables.

    Also a similar post that might be worth looking at - http://www.sqlservercentral.com/forums/shwmessage.aspx?messageid=121370

  • Thanks everyone for your replies and the link: very helpful!

    I did not mention that all tables have primary keys (of course) so they have clustered indexes created by default.  I have left these untouched and just dropped and recreated the non-clustered indexes, since I'm truncating first.  From all the info given this also seems to be best practice.

  • We have changed our loading strategy over time, and depending upon the situation and changing business requirements. As our company becomes more dependent upon our data warehouses and the warehouse tables are becoming used more on a 24X7 basis we had a need to change our loading strategy. We used to do a truncate and reload (using BCP) for most of our warehouse tables, dropping all the indexes first to speed loading. We would also try to have users access the data through views and have the views access the tables using the NOLOCK hint to try to prevent locking issues. We addressed other locking issues by doing a KILL before performing the load. I found this really was not kind to the users and we still encountered locking issues. Now for those warehouse tables where the percent of data that changes is relatively small when copmpared to the overall size of the table and we have room in the database, we perform an incremental load using checksum processing; the table we would like to totally refresh is now loaded into a working table (using BCP with no indexes), a primary key is built on the working table, checksums are calculated for the working table and the warehouse table, and the individual table rows are updated accordingly. This strategy minimizes locking issues and allows users to use the table during the update. Indexes on the warehouse table are left in place (but should be rebuilt periodically as required depending upon the percent of data being updated and how quickly the data / indexes become fragmented).

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply