Large dataset in one table or several small tables

  • Hello.

    Here is my question: Is it better to store a "large" set of data in one table or break it across smaller tables, each with the same schema?

    Here is the background: In my client's words there are currently "millions-soon-to-be-billions of records", and they are expecting "thousands of read and write hits per day". The data table is flat, has 33 columns total, and an expected row size of 510 bytes (based on directions at "http://msdn.microsoft.com/en-us/library/ms189124.aspx"). 5 of the 33 columns are used to identify a particular 'real-world entity' in the table. (this is where my vocabulary may trip me up - hopefully i am using 'Entity' correctly). A single real-world entity will likely have more than one row of data associated with it (imagine a single garden plot with many 'crops' planted in the garden. The 'entity' is the garden plot; the data are the various crops planted in the garden). The schema for the 5 columns used to identify an 'entity' is listed below (I left out the remaining 28 columns for brevity. I can include them if necessary). The remaining fields are predominantly varchar with a couple of ints and dates in the mix as well.

    CREATE TABLE [dbo].[TestTable](

    [field1] [varchar](10) NOT NULL,

    [field2] [varchar](7) NOT NULL,

    [field3] [varchar](7) NOT NULL,

    [field4] [varchar](10) NOT NULL,

    [field5] [date] NOT NULL

    )

    I have recommended that the client keep the data in a single table, indexed based on the 5 identifying columns. The data are specific to a US State/County combination so the client is learning towards a "per state" approach, i.e. one table for each of the 50 states, or (really crazy i think) a "per-county" approach with 3000+ tables, one for each US county.

    The business-logic that has me concerned, or in doubt, about my recommendation is that when a single row for a given real-world entity is updated, all rows associated with that entity are deleted, and the complete set of data for the given 'real world entity' is inserted. My concern is that potentially 'thousands of writes' per day would negatively impact the table's index. But i am figuring that regular rebuilds of the index done as part of a maintenance plan would take care of the fragmentation issues (if that is even an issue).

    What do you all think - one 'large' table or several/many smaller tables?

    Thank you

  • I think one table will be fine with some caveates. You have the 5 columns that identify an "entity". You also say that "entity" can have more than one row. My question would be, what do have to uniquely identify a row? THAT will be a key to performance for any of your non-clustered indexes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff.

    Thank you for the fast reply. The business logic dictates that we would never have to identify a single row, per se. Using the 'garden' analogy i set up in the question, the app would always want every crop (row) associated with the garden; not just a particular crop in the garden. In most cases, each query against the table would use all 5 identifying columns. In those queries, all columns in the matching rows would be returned. THere is one other query that would use 3 of the 5 identifying columns. For that query, a subset of the columns from the matching rows would be returned.

    Your question does bring up another question for me: how critical would it be to add a primary key in the one large table approach?

    Thanks again.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply