Denormalisation and the other option.

  • I recently got a requirement for designing a database for capturing sales and sector competitor data. This data would be primarily used for reports and would be updated once or twice in a month. Since it’s a reporting module not more than 5 to 10 people would be using it at a time (10 being slightly rare).

    We are to integrate this with our existing product and there are plans to “productise” it. The orders from the top brass is to keep it as generalized as possible so that adapting it to other requirements would be relatively easy.

    I had proposed a slightly de-normalized schema and someone else produced something on which I would like to have your comments.

    This is what was proposed.

    Table1

    AttributeID

    Section

    Subsection

    Attribute

    Data

    1

    Order

    OrderDetails

    OrderID

    <SOMEDATE>

    2

    Order

    OrderDetails

    Quantity

     

    3

    Order

    OrderDetails

    ProductID

     

    4

    Order

    OrderDetails

    SalesPerson

     

    5

    State

    City

    Name

     

    6

    State

    NULL

     

     

     

    Table 2

    SrNo(Unique ID)

    AttributeID

    Value

    1

    5

    State1

    2

    5

    State2

    3

    6          

    City1

    4

    6

    City2

    5

    6

    City3

    6

    1

    101

    7

    2

    500

    8

    3

    P1

    9

    4

    Andrew N

    10

    6

    City3

    11

    6

    City4

     

    This is the way in which the entire schema will be represented and there will be np physical mapping of the logical schema shown in Table 1.

    Hence Table 2 will contain all the data for all the tables, which will be represented logically in Table 1. I foresee not more than 0.5 Million records at its peak (in about 8 years) so generally it would increment in about 80-90 K records/year.

     

    The queries would be first made of the Table1 and the schema would be picked up and then Table2 would be queried for the actual values.

     

    I am at a loss of words hence asking for your comments on the same….


    What I hear I forget, what I see I remember, what I do I understand

  • Read http://www.sqlservercentral.com/columnists/dpeterson/lookuptablemadness.asp

    If you use foreign keys then I can see the potential for performance issues when it comes to deleting records.

    If you look at the execution plan of a delete query when DRI is declared you will notice that it searches both tables to detect if any insert,update or delete is valid. Inserts and updates tend not to be a problem because they tend to be single record operations.

    Deletes on the other hand tend to be large archiving processes so what you will be doing (assuming that DRI is declared) will be a query on two quite large tables.

    If you are going to use this sort of schema then your aim should be to limit its use to attributes that have a small number of values. To clarify, you wouldn't want attributes with several thousand values.

  • I like "slightly denormalized" data for reporting, if it's a secondary data store and not a primary one.

    I agree with David that this can become cumbersome, not only for deletes, but for simple daily troubleshooting and querying.

  • Just to add a bit more. This is called an Open schema system. It can be very flexible if you are looking at the extensibility of it but on data integrity it could become a royal pain and in my opinion you will really need to allow DML through SP only and those SP will have to be taylored VERY carefully to guarrant data quality at least through code!

    Just my $0.02 


    * Noel

  • "Use the Force, Luke!"

    I would like to see YOUR version that is "slightly denormalized"... Having been on both ends... a user/manager whispering in my ear "let's eliminate all those messy joins and put this all in one big table" vs. the architect saying "it's only seven tables and ___ can handle that easily"....

    Denormalization is fine when justified; but make sure it is justified.  Consider each element and how it is being used.  The management orders of "generalization" and "integration" also mean there is a lot more to be considered.  You must also consider what you are integrating with... and if it were to become a product, whether you would be proud for anyone to see it

    I do not know how skilled you are with this; but I suspect this is a pretty new experience for you or you would not be lost for words.  There is no substitute for analysis on something like this and understanding exactly how your data is used and why you are putting it "there"...

    Okay... off my soapbox... good luck to you... and err, Happy New Year!!!


    Cheers,

    david russell

  • Thanks for all the replies.

    The structure which I had proposed (part of it) is under keyed.

    Order

    OrderID

    SalesPersonId

    Date

    1

    Order

    <SOMEDATE>

    2

    Order

     

    3

    Order

     

    4

    Order

     

    5

    State

     

    6

    State

     

     

    Order Details

    OrderID

    ProductID

    Quantity

    Attribute

    Data

    1

    Order

    OrderDetails

    OrderID

    <SOMEDATE>

    2

    Order

    OrderDetails

    Quantity

     

    3

    Order

    OrderDetails

    ProductID

     

    4

    Order

    OrderDetails

    SalesPerson

     

    5

    State

    City

    Name

     

    6

    State

    NULL

     

     

     

     

    Region

    RegionID

    RegionName

    Parent

    1

    State1

    0

    2

    State2

    0

    3

    City1    

    State1

    4

    City2    

    State1

    5

    City3

    State2

    6

    City4

    State3

    When one considered solution 1 (mentioned in the first post), the complexicity lies when there are joins required with external tables for data retrieval.

    Not to mention the fact that every insert update or delete will have to be broken down into multiple rows rather than a single row. Even a simple select which would other wise be returned in a single row format would be returned in the multiple row format.

    Eg:

    6

    City4

    State3

    Will be returned as

    6

    City4

    State3

    Since that’s how the data is stored for the solution in my original post.

    One will have to do additional processing to convert the rows into columns, which would obviously be needed to display the data to the user in the right format.

     

    And though I am a lil new to the concept, I am not totally alien to designing databases

     

    Wish you all a happy new years.


    What I hear I forget, what I see I remember, what I do I understand

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply