Building a good Partitioning-Key

  • Hi there,

    in the last days I read a lot about partitioning and I getting more and more sure that I am not doing a good job on that 🙂

    Here is my situation:

    We got an DWH with about 125 Million user. Usually the selects we build are on small section of the dwh, so we made several tables that include just coherent attributes e.g. first-/lastname, adresse, lastlogin and so on.

    We also builded a partitioning key including:

    - portal,

    - (login-)activity of the user

    - flag if the data is historised or if it is the current value (all tables contain a valid_from date and the historised tables a valid_to as well)

    Example:

    - historized: 100 000

    - nonhistorized: 0

    +

    - portal 1: 1000

    - portal 2: 2000

    - portal x: x000

    +

    - activity <= 30 days: 30

    - activity <= 90 days: 90

    - activity <= 180 days: 180

    - activity > 180 days: 181

    So together e.g.: 102090.

    We update this key every day and this means we will have a lot of page-splits and this will conclude into fragmentation.

    As I NOW know there are four main attributes that are desirable for a clustering key:

    1. Narrow

    2. Unique

    3. Unchanging

    4. Ever increasing

    As you can see, we may have achieved the narrow-attribute, but the rest damn sure not.

    So I tried to figure out a new key:

    - I still think that the portal and maybe the activity of the user should be part of the partition-key.

    - Putting in the activity inherits the prolem that I would not be "unchanging".

    - Using the user_id for non-historizing-tables would make the partition-key "unique". Adding the valid_from for historizing-tables would make the key "unique", too.

    However the user_id is created by a oracle-server and is a sequencial-id, that means it is not "ever increasing".

    Additional the partion-key is getting so big, that I am just below the limit of what a bigint can take 😀

    and this could be a problem too, as a key should not be so big as it is located in every single nonclustered index.

    So I am out of ideas. Does anyone have an I idea how I can use the partitioning-function efficiently?

    Addition info: As we splitted the information of a user into several tables we join then with the user_id. That might be important for the partion-key.

    Cheers,

    Mitch

  • Help 🙂

  • As you can read in several articles there is a discussion whether putting in more then one file for each filegroup is increasing performance or not.

    However they all seam to agree that several filegroups will increase performance

    Sooo. Here is another thought, why do I not put the tables in different filegroups, let's say for the most used tables an extra filegroup, so while joining those there would be a few filegroups involved.

    Would that be enought or should I still consider to put a partition-key so I can split those tables even over more filegroups?

  • How big are the tables that you are looking to partition, also what are the typical quieries that are being run against them? To be reaching the limit of BIGINT is some going too. I have recently been looking into partitioning of some of our larger tables, breaking up a table with ~850 million rows. I have been using the following as reference material;

    http://msdn.microsoft.com/en-us/library/ms345146%28SQL.90%29.aspx

    http://msdn.microsoft.com/en-us/library/dd578580.aspx

    http://www.amazon.co.uk/Microsoft-Server-Administrator-27s-Companion-Package/dp/0735621985/ref=sr_1_1?ie=UTF8&qid=1265813592&sr=8-1

    JQ

  • > How big are the tables that you are looking to partition

    We have non historising tables that have about 130 Million rows. Growing about 50.000 rows per day.

    Additionally we have historising tables with currently 240 Mio. and growing about 100k per day.

    The largest table has about 500 Million rows. Growing about 500k rows a day (peaks of more changes possible).

    > also what are the typical quieries that are being run against them?

    typical queries are

    SELECT a.user_id, b.another_attribute

    FROM table_a AS a

    INNER JOIN table_b AS b ON (a.user_id = b.user_id)

    INNER JOIN table_c AS c ON (a.user_id = c.user_id AND c.valid_to IS NULL) //current entry in historised table

    OR

    INNER JOIN table_c AS c ON (a.user_id = c.user_id AND c.valid_from < @date1 AND (c.valid_to IS NULL OR valid_to > @date2)) //entry valid at a due day in historised table

    WHERE a.reg_date BETWEEN @date1 and @date2

    OR group by of many attributes

    following the same syntax

    > To be reaching the limit of BIGINT is some going too.

    I tried to create a unique partition_key so I added portal, valid_from and user_id together

    The larges user_id we currently have is 11 digits large. To reach the bigint-limit I just added 8 digits date and 2 digits portal_id and we already are there 😉

    As most joins will be using the user_id I figured out, that the user_id should be part of the clustering.

    So let's pretend that we use the user_id as partition-key.

    For historizing tables we have to add the valid_from (that is unique together with a user_id). using to coloums as partition-key means we have to compute one. so we build a 11-digits-user (* 100.000.000)+ 8 digits for the date

    E.g.

    The user 10001 has an entry for 2010-01-01. As well as the user 10002.

    Now I insert a new row for user 10001 on 2010-01-02, I would most likely create a page split as the data should be insert before the 10002-userdata.

    I really don't mind to do that and we are defragmenting are indexes every day so while defragmenting the cluster-index the table would be fixed as well.

    But I thought that it is a big DONT has you should not only "Unique" but "Ever increasing" keys.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply