PK question

  • I have a table where I store comments for user users. I will have 100 Million+ comments.

    2 ways I can create it:

    Option 1: user name and comment id as PK. That way all comments are stored physically by user name and comment id.

    CREATE TABLE [dbo].[channel_comments](

    [varchar](20) NOT NULL,

    [com_id] [int] IDENTITY(1,1) NOT NULL,

    [com_posted_by] [varchar](20) NOT NULL,

    [com_posted_on] [smalldatetime] NOT NULL CONSTRAINT DEFAULT (getdate()),

    [com_text] [nvarchar](225) COLLATE NOT NULL,

    CONSTRAINT [PK_channel_comments] PRIMARY KEY CLUSTERED

    ([channel] ASC, [com_id] ASC) WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]) ON [PRIMARY]

    Pros: My query will be get all or top 10 comments for a user order by comment_id DESC. This is SEEK

    Option 2: I can make the comment id as the PK. That will store the comments sorted by the comment id, not user name.

    Cons: Getting latest top 10 comments of a given user is not a seek anymore as data not stored by user (ie. not sorted by user). So I have to create other index to improve the query performance.

    Which way is best way to proceed?

    How about insertion and deletion? These operations are allowed. But read is frequent.

    User can't modify their comments.

  • Your Primary Key should uniquely identify each row. I think you are confusing Primary Key with Clustered Index. You are using the IDENTITY property for your com_id which will always be unique.

    At a first glance, I'd go with the single column PK (Clustered) on the surrogate key com_id and add a non-clustered index on User.

    How many users are in the table? How many avg comments per user? What is the total row count on your table?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (10/21/2010)


    Your Primary Key should uniquely identify each row. I think you are confusing Primary Key with Clustered Index. You are using the IDENTITY property for your com_id which will always be unique.

    At a first glance, I'd go with the single column PK (Clustered) on the surrogate key com_id and add a non-clustered index on User.

    How many users are in the table? How many avg comments per user? What is the total row count on your table?

    Confusing PK:

    You partly correct. The com_id is not required. I added it so that I find the comment easily during deletion

    I could just use the data column with the user name to find the comment.

    Number of users: 100M (in the long term)

    Avg comments: 200 (guess)

    Total comments: 2 Billions

    I tested both scenario here is the result:

    comments1: PK: [user_name, com_id] no other indices

    comments2: PK: com_id Second Index on (user_name, com_id)

    table_name rows reserved data index_size unused

    comments21079892 99488 KB62824 KB36576 KB88 KB

    comments11079892 82376 KB82040 KB328 KB 8 KB

    So the table with com_id as PK is using 36MB extra disk space just for the 2 index

    Also the select query on both table using SEEK, but table with com_id as PK is slower

    But insertion is slightly faster when I have com_id as PK

  • What does your SELECT statement look like?

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • select top 10 * from comments1 where user='bob' order by com_id DESC

  • I ran a test where I populated your table with 60 million rows. I made the com_id column the Primary Key and clustered it. I added a non-clustered index on User (you don't need to add com_id as a second column on the non-clustered index as non-clusterd indexes already include the clustered index key).

    I ran your query of SELECT TOP 10 * FROM dbo.channel_comments WHERE [User] = 'User3' ORDER BY com_id

    The execution plan shows a clustered index scan and returned results in under a second. This tells us that you don't even need the non-clustered index on the User column. The optimized chooses the clustered index because of the use of the com_id in the ORDER BY clause.

    I would recommend using the clustered PK on com_id and that's it. You can add non-clustered indexes on other columns as needed to service other queries.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (10/22/2010)The execution plan shows a clustered index scan and returned results in under a second. This tells us that you don't even need the non-clustered index on the User column. The optimized chooses the clustered index because of the use of the com_id in the ORDER BY clause.

    During your test, how was cardinality on USER column John?

    I think optimizer didn't choose index on this column because of data distribution. An extreme example would be all 60M rows having the same value in USER column, in such a case optimizer would not use the index; on the other extreme if you have 60M distinct USER values in the table optimizer will certainly use the index.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (10/22/2010)


    John Rowan (10/22/2010)The execution plan shows a clustered index scan and returned results in under a second. This tells us that you don't even need the non-clustered index on the User column. The optimized chooses the clustered index because of the use of the com_id in the ORDER BY clause.

    During your test, how was cardinality on USER column John?

    I think optimizer didn't choose index on this column because of data distribution. An extreme example would be all 60M rows having the same value in USER column, in such a case optimizer would not use the index; on the other extreme if you have 60M distinct USER values in the table optimizer will certainly use the index.

    Your right Paul. My first test only contained 6 users. I re-ran the test and created the 60 million + rows over 600 users. In this case, the optimizer chose to use the non-clustered index on the User column.

    So I guess, back to my original suggestion......clustered PK on com_id, non-clustered index on User.

    Thanks for keeping me honest Paul.:-)

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (10/22/2010)


    PaulB-TheOneAndOnly (10/22/2010)


    John Rowan (10/22/2010)The execution plan shows a clustered index scan and returned results in under a second. This tells us that you don't even need the non-clustered index on the User column. The optimized chooses the clustered index because of the use of the com_id in the ORDER BY clause.

    During your test, how was cardinality on USER column John?

    I think optimizer didn't choose index on this column because of data distribution. An extreme example would be all 60M rows having the same value in USER column, in such a case optimizer would not use the index; on the other extreme if you have 60M distinct USER values in the table optimizer will certainly use the index.

    Your right Paul. My first test only contained 6 users. I re-ran the test and created the 60 million + rows over 600 users. In this case, the optimizer chose to use the non-clustered index on the User column.

    So I guess, back to my original suggestion......clustered PK on com_id, non-clustered index on User.

    Thanks for keeping me honest Paul.:-)

    Thank you for re-running the test John; that was very kind of you.

    I fully agree with your recommendation. 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • I've had very good success using the surrogate identity key as the clustering key in this type of table as long as the other index is selective enough. The key lookups don't cost all that much.

    Todd Fifield

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply