Fill factor on historical table

  • Hi you all!

    I have a doubt about setting fill factor. I've read that when we use a identity column on a table and this table has no update, we must use a fill factor with 95 or higher. And so, when we use a date as part of the key in a historical table, should we follow the same idea when using identity column since the rows we are inserting each passing day is always increasing (the snapshot) ???. Just to clarify... it is about a tables used in a data warehouse where we create a snapshot from the entire table all days.

    Regards,

    Rafael Melo - Br

  • From The Clustered Index Debate Continues... by Kim Tripp

    The first and most important point to stress is that minimizing page splits is NOT the only reason nor is it the most important. In fact, the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits).

    Choose a key that makes sense for how you'll access the data. I doubt the datetime column you're referring too will be unique in your table in and of itself, date types rarely are in a DW scenario. By adding more columns to the key to make it unique I suspect you'll be risking getting away from "narrow". You could always rely on the uniqueifier, and I have seen that work well as a 12-byte key is not too bad (8-byte datetime + 4-byte uniqueifier). Using smalldatetime would be better if you do not need the precision to get to an 8-byte key (4-byte smalldatetime + 4-byte uniqueifier). An identity column is a good place to start your evaluation as it satisfies all three properties in the article. If you choose an identity and you're sure you won't ever update your table then I would choose a FF of 100.

    At the end of the day though it will depend on your data and workload, which we cannot see, so test different approaches until you meet your desired performance levels.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Opc,

    I have sure i have expressed myself in a wrong way. Actually i have a composite column in my index key based on my business requirement. But besides my business key i have in my index the date (with no time) as part of it. One of the my tables that i'm looking at is a "delivered products" because we are a agricultural cooperative. So, we need to keep the historical information for all the records from this table for each passing day, because we need accomplish the "live" from this records since the product was delivery until it be completely sold. As i have said, we keep the snapshot from all records for each day.

    Eg. of my key : LoadDate, Customer, Product, DeliveryDate

    My "LoadDate" is the date i control the snapshot. So, my doubt is about it. Since the "LoadDate" is always incremental and this table has no delete nor update.... What FF should I use in this case...

    Please, if i wasn't clear enough, let me know....

    Regards,

    Rafael - Br

  • If LoadDate is ever-increasing, you always insert your snapshots for a given LoadDate using a single insert...select statement (i.e. you never go back and add rows with an existing LoadDate later) and you never update that table then I would choose a FF of 100.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Opc,

    Many thanks for your answer. But just clarify me one more thing. Lets assume this is my key (LoadDate, Customer, Product, DeliveryDate) and it is a clustered index and the load of the data is done using a single insert statement with no order by clause. We know when using a clustered index the data are stored in a physical order based on the key. So, what happen in my insert since it is not ordered by the key??? Of course the LoadDate will be the same but the others keys could NOT be in a sort way. Is it generate fragmentation on clustered index? In this case should i use the FF of 100?

    Since now thanks for your helping.

    Regards,

    Rafael Melo - BR

  • It may generate fragmentation. It will depend on the number of CPUs you have and whether SQL Server decides to parallelize the insert. All an ORDER BY will do is guarantee logical order which would set any IDENTITY column per your ORDER BY. You can try your INSERT...SELECT with an ORDER BY and MAXDOP hint of 1 to see how that affects things. The only way to guarantee no fragmentation may be to insert one row at a time, but that would be no fun. Test it out and see, that will be the only way to know. I would use FF 100, and adjust down from there after testing proved a better path based on your environment.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Using LoadDate as the leading column in primary clustered key makes sense in a table containing historical rows, because it's sequential (I'm assuming it's based on system date at the time of insert), and most queries against historical tables are within the context of time anyhow.

    To make this clustered key both unique and narrow, you can consider combining LoadDate with a sequential id (LoadSeqID) that is derived using the rank() function like in the example below. So each row within each LoadDate has a unique sequential ID. The advantage of rank() over row_number() is that an idential insert will result in an identical LoadSeqID, which is what you want to prevent duplicates.

    declare @Delivery_History table

    (

    primary key (LoadDate, LoadSeqID ),

    LoadDate Date not null,

    LoadSeqID int,

    Customer int not null,

    Product int not null,

    DeliveryDate Date not null

    );

    insert into @Delivery_History

    ( LoadDate, LoadSeqID, Customer, Product, DeliveryDate )

    select

    getdate(),

    rank() over ( order by Customer, Product, DeliveryDate ) LoadSeqID,

    Customer, Product, DeliveryDate

    from

    (

    select 45 Customer, 4553 Product, '2012-05-01' DeliveryDate union all

    select 45 Customer, 2282 Product, '2012-05-01' DeliveryDate union all

    select 52 Customer, 1131 Product, '2012-05-02' DeliveryDate

    ) Deliveries;

    select * from @Delivery_History;

    LoadDate LoadSeqID Customer ProductDeliveryDate

    2012-05-10 1 45 22822012-05-01

    2012-05-10 2 45 22822012-05-01

    2012-05-10 3 45 45532012-05-01

    2012-05-10 4 52 11312012-05-02

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Hello Eric,

    Since my business key do not duplicate, the identity column is not necessary, is it?

    Eric, i use the system date at the time of my insert. The Customer, Product and DeliveryDate doesn't not return in a sequential way in my select/insert. Does it generate any fragmentation since the SQL Server must ordered this columns to insert according my key? Or should I use an order by clause to order my key before inserting?

    Regards,

    Rafael Melo - BR

  • rafael_si (5/14/2012)


    Hello Eric,

    Since my business key do not duplicate, the identity column is not necessary, is it?

    No, it's not necessary. Having LoadDateTime + ID would just be an alternative method of building a PK, if you didn't want to include five or six business columns in a natural PK. One thing to consider is that the clustered key is used as the table's ROWID when building non-clustered indexes. So, the shorter your table's PK, the smaller your indexes.

    rafael_si (5/14/2012)

    Eric, i use the system date at the time of my insert. The Customer, Product and DeliveryDate doesn't not return in a sequential way in my select/insert. Does it generate any fragmentation since the SQL Server must ordered this columns to insert according my key? Or should I use an order by clause to order my key before inserting?

    Regards,

    Rafael Melo - BR

    From what I've seen, when inserting a large resultset into a clustered table, the execution plan and fragmentation percentage is identical regardless of wether you specify an ORDER BY. It's as if SQL Server will ignore the ORDER BY and always perform it's own Sort operation. However, it depends on many different variables like fill factor, table partitioning, or advanced query hints or trace flags.

    I'd suggest staging some data on a development database and experimenting with several variations of the INSERT. Enable 'Include Actual Execution Plan' when running the tests, and also use the following script to confirm the level of fragmentation.

    SELECT

    schema_name(ST.schema_id)schema_name,

    (object_name(IPS.object_id))table_name,

    (SI.name)index_name,

    (IPS.Index_type_desc)index_type,

    (CASE WHEN ips.index_level = 0 THEN 'LEAF' ELSE 'NON-LEAF' END) AS index_level,

    (IPS.alloc_unit_type_desc)alloc_unit,

    ((sum(IPS.page_count)*8000)/(1024*1024))size_mb,

    sum(IPS.record_count)rec_count,

    cast(avg(IPS.avg_record_size_in_bytes) as numeric(9))avg_rec_bytes,

    cast(avg(IPS.avg_fragmentation_in_percent) as numeric(6,2))avg_frag_pct,

    sum(IPS.fragment_count)frag_count,

    cast(avg(IPS.avg_fragment_size_in_pages) as numeric(5,1))avg_frag_size_pages,

    cast(avg(IPS.avg_page_space_used_in_percent) as numeric(6,2))avg_page_used_pct,

    sum(IPS.forwarded_record_count)forwarded_count,

    sum(IPS.ghost_record_count)ghost_count

    FROM sys.dm_db_index_physical_stats

    (

    db_id(),

    NULL,

    NULL,

    NULL ,

    'DETAILED'

    ) IPS

    JOIN sys.tables ST WITH (nolock) ON IPS.object_id = ST.object_id

    JOIN sys.indexes SI WITH (nolock) ON IPS.object_id = SI.object_id AND IPS.index_id = SI.index_id

    WHERE

    schema_name(ST.schema_id) = 'MySchema'

    and (object_name(IPS.object_id)) = 'MyTable'

    GROUP BY

    schema_name(ST.schema_id),

    (object_name(IPS.object_id)),

    (SI.name),

    (IPS.Index_type_desc),

    (IPS.alloc_unit_type_desc),

    (CASE WHEN ips.index_level = 0 THEN 'LEAF' ELSE 'NON-LEAF' END)

    ORDER BY

    (schema_name(ST.schema_id)),

    (object_name(IPS.object_id)),

    case (IPS.Index_type_desc) when 'CLUSTERED INDEX' then 1 when 'NONCLUSTERED INDEX' then 2 else 3 end,

    (SI.name),

    (CASE WHEN ips.index_level = 0 THEN 'LEAF' ELSE 'NON-LEAF' END)

    GO

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (5/14/2012)


    rafael_si (5/14/2012)


    Hello Eric,

    Since my business key do not duplicate, the identity column is not necessary, is it?

    No, it's not necessary. Having LoadDateTime + ID would just be an alternative method of building a PK, if you didn't want to include five or six business columns in a natural PK. One thing to consider is that the clustered key is used as the table's ROWID when building non-clustered indexes. So, the shorter your table's PK, the smaller your indexes.

    The Primary Key (PK) is not the determining factor. The clustered index is the determining factor. They do not have to be one in the same so should not be used interchangeably in this context.

    I'd suggest staging some data on a development database and experimenting with several variations of the INSERT. Enable 'Include Actual Execution Plan' when running the tests...

    +1

    The only way to know how to properly set your fill factor is to test with your data. Try different coding approaches, evaluate the results and then decide.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Thank you guys for you time and patience....

    I will do what you've proposed in my test environment. My doubts was solved!

    obs: is there a way in this forum to close this thread or mark as problem solved?

    Thank you all and have a nice week.

    regards,

    Rafael Melo - Br

  • rafael_si (5/14/2012)


    Thank you guys for you time and patience....

    I will do what you've proposed in my test environment. My doubts was solved!

    obs: is there a way in this forum to close this thread or mark as problem solved?

    No need (or way) to close threads here. Someone may come along and correct something said on the thread at a later time. This is a discussion forum. Unlike some of the Question & Answer forums out there where 'unanswered', 'answered' or 'closed' are states a thread can get into, here, a thread is always open for new information or discussion.

    Thank you all and have a nice week.

    regards,

    Rafael Melo - Br

    Thanks! You too 🙂

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply