Large Table Design Best Practices

  • That column cannot be made NOT NULL in the design because you can't enter that data when the first part is created. How do you enforce the NOT NULL requirement? Remember that the OP stated that the data is entered in stages, not all at once. If it were a single point of entry that captured every piece of data then absolutely I would not suggest splitting this would be a good idea. But given the multi-step process involved here that isn't the case. This means you have to allow NULL in all the columns and then check for their existence later from the application side. That sounds like a nightmare to me.

    This is not a concrete example, and you must have some good designs if you think this is a

    nightmare

    . First, I don't have issues with NULLable fields. I'm more strict on the use of empty strings than most, but with dates that's not possible. If the data is divided you would have to do LEFT JOINs to evaluate NULL sides. That to me seems unnecessarily difficult for no apparent reason. That is hardly a nightmare, but it's not good.

  • RonKyle (3/3/2016)


    That column cannot be made NOT NULL in the design because you can't enter that data when the first part is created. How do you enforce the NOT NULL requirement? Remember that the OP stated that the data is entered in stages, not all at once. If it were a single point of entry that captured every piece of data then absolutely I would not suggest splitting this would be a good idea. But given the multi-step process involved here that isn't the case. This means you have to allow NULL in all the columns and then check for their existence later from the application side. That sounds like a nightmare to me.

    This is not a concrete example, and you must have some good designs if you think this is a

    nightmare

    . First, I don't have issues with NULLable fields. I'm more strict on the use of empty strings than most, but with dates that's not possible. If the data is divided you would have to do LEFT JOINs to evaluate NULL sides. That to me seems unnecessarily difficult for no apparent reason. That is hardly a nightmare, but it's not good.

    I don't have issues with NULLable columns either unless the business rules state that a value is required in another piece of code for processing. I am NOT ok with required columns being NULLable in the schema. That is a slippery slope to say the least. I disagree that a LEFT JOIN is difficult but whatever. Obviously we will just have to agree to disagree on this one. Ultimately it is the decision of the OP. 😛

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Rereading the original post again, this data should be normalized. It's clearly relational based on the example. It's collected in a series of stages. Some of the data is required, but, not in any given particular stage. It's nuts to shove it all into a single table.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I have to agree with Sean and Grant. You can get great performance out of a normalized structure and the integrity is there in the design. Moving the integrity to the application means that the application cover all rules, both now and in the future. If someone happens to go into SSMS and change something, all bets are off. Enforcing integrity in the design matters. Data types, foreign keys, nullability,etc. all matter. Without data integrity, performance doesn't matter.

    A good example of a good use for a "side table" (1:1 relationship) is where you have wide comment columns. If most of what you're doing in the main table is analyzing numbers, and you need the comment columns 5% of the time, you're going to get a lot more rows per page in the main table by moving the comments to a side table. Your reads against the main table are going to be much lower because you get more rows per page and therefore need to read fewer pages. So you save big on 95% of your queries and join the comments table 5% of the time.

  • I have to agree with Sean and Grant. You can get great performance out of a normalized structure and the integrity is there in the design. Moving the integrity to the application means that the application cover all rules, both now and in the future. If someone happens to go into SSMS and change something, all bets are off. Enforcing integrity in the design matters. Data types, foreign keys, nullability,etc. all matter. Without data integrity, performance doesn't matter.

    A good example of a good use for a "side table" (1:1 relationship) is where you have wide comment columns. If most of what you're doing in the main table is analyzing numbers, and you need the comment columns 5% of the time, you're going to get a lot more rows per page in the main table by moving the comments to a side table. Your reads against the main table are going to be much lower because you get more rows per page and therefore need to read fewer pages. So you save big on 95% of your queries and join the comments table 5% of the time.

    None of this is in dispute. Even your example is similar to the example that I gave about a one to one. An employee can't have more than one company car but most employees do not have a company car.

    An example:

    Work Order No (PK)

    Creation Date

    Work Order Status

    Technician Assigned

    Scheduled On Date

    Scheduled For Date

    Closed Date

    This is a simplified version of an accumulating snapshot table. It's easy to identify three different points at which data would be entered. What is unnormalized here? You all seem to advocate that to make it normalized the table would have to be split into three. Which of the normal forms does this violate? I have worked with designs with a single table and a design (not designed be me) where they are split out, and my experience is that the split design creates unnecessary issues.

    The original example is an accumulated snapshot example. If the data has a 1:1 or 1:most relationship and not a 1:few or 1:many, it should be in one table. To say that this is unnormalized is (in the absence of a clear answer to the above) an incorrect comment.

  • RonKyle (3/4/2016)


    I have to agree with Sean and Grant. You can get great performance out of a normalized structure and the integrity is there in the design. Moving the integrity to the application means that the application cover all rules, both now and in the future. If someone happens to go into SSMS and change something, all bets are off. Enforcing integrity in the design matters. Data types, foreign keys, nullability,etc. all matter. Without data integrity, performance doesn't matter.

    A good example of a good use for a "side table" (1:1 relationship) is where you have wide comment columns. If most of what you're doing in the main table is analyzing numbers, and you need the comment columns 5% of the time, you're going to get a lot more rows per page in the main table by moving the comments to a side table. Your reads against the main table are going to be much lower because you get more rows per page and therefore need to read fewer pages. So you save big on 95% of your queries and join the comments table 5% of the time.

    None of this is in dispute. Even your example is similar to the example that I gave about a one to one. An employee can't have more than one company car but most employees do not have a company car.

    An example:

    Work Order No (PK)

    Creation Date

    Work Order Status

    Technician Assigned

    Scheduled On Date

    Scheduled For Date

    Closed Date

    This is a simplified version of an accumulating snapshot table. It's easy to identify three different points at which data would be entered. What is unnormalized here? You all seem to advocate that to make it normalized the table would have to be split into three. Which of the normal forms does this violate? I have worked with designs with a single table and a design (not designed be me) where they are split out, and my experience is that the split design creates unnecessary issues.

    The original example is an accumulated snapshot example. If the data has a 1:1 or 1:most relationship and not a 1:few or 1:many, it should be in one table. To say that this is unnormalized is (in the absence of a clear answer to the above) an incorrect comment.

    For your example, only 'Work Order Status' and 'Technician Assigned' could be normalized out in a standard structure. My question is simple. Why not? One of the many purposes of normalization is to reduce the data footprint and increase the accuracy of the data by ensuring that the same values are used over and over. If the status is supposed to be 'Closed' why not normalize this to have a lookup table that ensures that the value 'Closed' is used for 'Closed' each and every time the data is entered. Downstream reporting is always going to work that way. Without the referential integrity, a single 'Close' or 'C' or anything other value put in there invalidates data. Now, this example isn't that important in the grand scheme of things... unless we get paid for every 'Closed' value. If we don't enforce the integrity of the data, we could literally be losing money because of bad data.

    I just don't see what you're gaining by duplicating data storage and eliminating enforced data integrity?

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I see a lot of discussion about whether or not there are normalization errors in the design you posted. There might or might not be, but that's something only you can find out, by preparing a nice set of examples for your domain expert to go through and confirm or deny what you right now assume are the functional dependencies. I do understand why some people post that they think your design is not fully normalized.

    However, I will address the question you are actually asking which is: if I have a huge amount of attributes that are all functionally dependent on the same candidate key, should they go into a single table or should they be split. Especially if there is a timeline to when that data comes available.

    From a pure normalization perspective, both are okay (well, except wheh you consider sixth normal form, which allows at most one non-key attribute per table - but that's not a normal form I recommend, I usually stop at fifth).

    From a performance perspective, there can be difference. For instance, if you implement in SQL Server, then a single table with hundreds of columns takes more disk space per row, so you need to read more rows when you need to report on just a few columns, if the alternative model would have those columns in a single table. On the other hand, if you need to report on columns that are distributed, you'll have to join the data back together, which incurs processing overhead.

    You of course already addressed that some constraints are easier to enforce with a split-table design. If the "Load Program ID" attribute is mandatory for any loan in a stage where the Loan Terms are relevant, you can make it a NOT NULL column in a Loan Terms table; in a single table that would instead become a CHECK constraint and you would also need to have a phrase in the CHECK constraint to prevent people from entering it before the Loan Terms phase applies.

    No clear-cut advice from me, just some food for thought. I hope this puts you on the right track, and enables you to choose what works best in your situation. Good luck!


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Sean Lange (3/3/2016)


    Shall I repeat the concrete example yet again? You have a column in step 3 that is required. That column cannot be made NOT NULL in the design because you can't enter that data when the first part is created. How do you enforce the NOT NULL requirement?

    CHECK ( (StepNo < 3 AND ThatColumn IS NULL) OR (StepNo >= 3 AND ThatColumn IS NOT NULL) )


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (3/4/2016)


    Sean Lange (3/3/2016)


    Shall I repeat the concrete example yet again? You have a column in step 3 that is required. That column cannot be made NOT NULL in the design because you can't enter that data when the first part is created. How do you enforce the NOT NULL requirement?

    CHECK ( (StepNo < 3 AND ThatColumn IS NULL) OR (StepNo >= 3 AND ThatColumn IS NOT NULL) )

    This would work if a new column was added to indicate which step in the process this is. Of course something along those lines would be a good idea regardless of the normalization.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • For your example, only 'Work Order Status' and 'Technician Assigned' could be normalized out in a standard structure. My question is simple. Why not? One of the many purposes of normalization is to reduce the data footprint and increase the accuracy of the data by ensuring that the same values are used over and over. If the status is supposed to be 'Closed' why not normalize this to have a lookup table that ensures that the value 'Closed' is used for 'Closed' each and every time the data is entered.

    I think I see some of the issue: we don't agree on the definition and purpose of normalization. Assuming that the single table is a valid structure [and again, without data to analyze, this can't be known with certainty], removing those two columns neither normalizes nor makes it "more normalized." How does this change accomplish the goals of normalization as you've stated them? I agree to a point with the reasons you gave, though I would say as a first statement normalization is to model the data design to comply with the normal forms. Doing this accomplishes the reasons you state. Hugo's comments shed additional light on this issue and with which I'm in complete agreement. (Other than the 6th normal form thing, about which I know nothing. I stop at five and as a practical matter don't find much above BCNF.)

    I don't have a Closed column in the simplified design, but a Closed Date. Ideally this would be blank if the order is Open because the date doesn't exist, as opposed to being unknown. This is not possible with a date data type. But it doesn't warrant placing it in it's own table, requiring the primary key to be repeated.

  • Sean Lange (3/4/2016)


    Hugo Kornelis (3/4/2016)


    Sean Lange (3/3/2016)


    Shall I repeat the concrete example yet again? You have a column in step 3 that is required. That column cannot be made NOT NULL in the design because you can't enter that data when the first part is created. How do you enforce the NOT NULL requirement?

    CHECK ( (StepNo < 3 AND ThatColumn IS NULL) OR (StepNo >= 3 AND ThatColumn IS NOT NULL) )

    This would work if a new column was added to indicate which step in the process this is. Of course something along those lines would be a good idea regardless of the normalization.

    Or you could simply deduce the step by choosing one of the columns filled in step 3 to be the one that defines step 3 as being started, and changing the CHECK constraints to look at that column instead:

    CHECK ( (DefiningColumn IS NULL AND ThatColumn IS NULL) OR (DefiningColumn IS NOT NULL AND ThatColumn IS NOT NULL) )

    For the defining column itself, the check constraint would enforce that it cannot be filled *before* step 2's columns are filled.

    And yes, I agree that having an explicit column to show the step (either as a number or as a code) is probably a much better idea.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (3/4/2016)


    Sean Lange (3/4/2016)


    Hugo Kornelis (3/4/2016)


    Sean Lange (3/3/2016)


    Shall I repeat the concrete example yet again? You have a column in step 3 that is required. That column cannot be made NOT NULL in the design because you can't enter that data when the first part is created. How do you enforce the NOT NULL requirement?

    CHECK ( (StepNo < 3 AND ThatColumn IS NULL) OR (StepNo >= 3 AND ThatColumn IS NOT NULL) )

    This would work if a new column was added to indicate which step in the process this is. Of course something along those lines would be a good idea regardless of the normalization.

    Or you could simply deduce the step by choosing one of the columns filled in step 3 to be the one that defines step 3 as being started, and changing the CHECK constraints to look at that column instead:

    CHECK ( (DefiningColumn IS NULL AND ThatColumn IS NULL) OR (DefiningColumn IS NOT NULL AND ThatColumn IS NOT NULL) )

    For the defining column itself, the check constraint would enforce that it cannot be filled *before* step 2's columns are filled.

    And yes, I agree that having an explicit column to show the step (either as a number or as a code) is probably a much better idea.

    Agreed. Using the condition of some column(s) to determine which step in the process can really bite you. Especially if the business rules change over time and suddenly the condition changes a little bit and you don't also update the code that checks for it. I think we have all been bitten by that kind of thing at one point or another.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (3/4/2016)


    Hugo Kornelis (3/4/2016)


    Sean Lange (3/4/2016)


    Hugo Kornelis (3/4/2016)


    Sean Lange (3/3/2016)


    Shall I repeat the concrete example yet again? You have a column in step 3 that is required. That column cannot be made NOT NULL in the design because you can't enter that data when the first part is created. How do you enforce the NOT NULL requirement?

    CHECK ( (StepNo < 3 AND ThatColumn IS NULL) OR (StepNo >= 3 AND ThatColumn IS NOT NULL) )

    This would work if a new column was added to indicate which step in the process this is. Of course something along those lines would be a good idea regardless of the normalization.

    Or you could simply deduce the step by choosing one of the columns filled in step 3 to be the one that defines step 3 as being started, and changing the CHECK constraints to look at that column instead:

    CHECK ( (DefiningColumn IS NULL AND ThatColumn IS NULL) OR (DefiningColumn IS NOT NULL AND ThatColumn IS NOT NULL) )

    For the defining column itself, the check constraint would enforce that it cannot be filled *before* step 2's columns are filled.

    And yes, I agree that having an explicit column to show the step (either as a number or as a code) is probably a much better idea.

    Agreed. Using the condition of some column(s) to determine which step in the process can really bite you. Especially if the business rules change over time and suddenly the condition changes a little bit and you don't also update the code that checks for it. I think we have all been bitten by that kind of thing at one point or another.

    Oh, come on, Sean. Requirements never change. Business rules always remain constant over time. 😉

  • RonKyle (3/4/2016)


    For your example, only 'Work Order Status' and 'Technician Assigned' could be normalized out in a standard structure. My question is simple. Why not? One of the many purposes of normalization is to reduce the data footprint and increase the accuracy of the data by ensuring that the same values are used over and over. If the status is supposed to be 'Closed' why not normalize this to have a lookup table that ensures that the value 'Closed' is used for 'Closed' each and every time the data is entered.

    I think I see some of the issue: we don't agree on the definition and purpose of normalization.

    Yes, I agree with that. Grant thinks (but let me assure you that he is far from alone in that) that replacing a column "StateName" with "StateCode" and simultaneously introducing a States table is normalization. It is not.

    Simplified example, if I have a table of US citizens with columns SSN (primary key), BornInStateName, LivesInStateName, then that design is perfectly normalized. It may not be smart for the reasons outlined by Grant in a previous post, but normalization is not about smartness. (And in fact, there are ways other than the one Grant suggested to enforce a single correct spelling, like CHECK constraints and/or dropdowns in the application).

    The same goes for a data design that has a single table with columns SSN (primary key), BornInStateCode, LivesInStateCode, and no state name stored anywhere - again perhaps not the smartest (allthough one might argue that everyone in the US knows all state codes by head anyway), and again perfectly correct from a normalization point of view.

    And yes, even a design that uses SSN (primary key), BornInStateCode, LivesInStateName does not violate any normalization rules, though I personally think this one runs rings around the other two in the stupid department.

    Normalization does have something to say about SSN (primary key), BornInStateCode, BornInStateName, LivesInStateCode, LivesInStateName. This has two transitive dependcies, which violates Third Normal Form. The process that Grant and others describe as "normalizing" should perhaps actually be called "adding a code or ID column for a long text attribute and then normalizing the result", for that is what they actually do.

    And to counter the storm of critique before it even starts, all the above is about the logical data model. If a data model calls for SSN (primary key), BornInStateName, LivesInStateName, I would still add state codes and normalize out the state names if I had to implement this in SQL Server, because I know that this will perform much better and save a lot of disk space. However, I know that there are also relational database engines that store their data in completely different ways and that will not benefit in any way from such an implementation choice. That is why it is important to focus on the information needs of the customer/end user only during the logical modeling phase, and postpone implenentation/optimization decisions to the implementation phase - and save the original logical design in case it ever needs to be implemented on another architecture.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Sean Lange (3/3/2016)


    As the OP stated this data is created by multiple steps in the process. Let's say for example that InsuranceCarrierID would be required for this to be valid but that information happens in step 3. This means you have to allow NULL in the database because when the row is created it can't be known. However, logically it can't be NULL. This means you have to include business rules in the application to enforce that value be supplied in that step. As I said, this is a double edged sword with neither side being 100% correct.

    I have been impressed with the quality of your numerous contributions to SSQLCentral so I am interested in your opinion on a concept of database design: Allow or ban any persisted NULL. This has spawned a lengthy debate on the following thread (I do not remember seeing you or Jeff Moden there):

    Database Design Follies: NULL vs. NOT NULL - http://www.sqlservercentral.com/Forums/Topic1629695-3537-1.aspx

    Some contributors are advocating using a 1:1 table for each column that might or might not be NULL to avoid persisting a NULL column in the database.

    The gist of it is that if there is no specific vavlue that can be entered for a column, then there is no record in the side table reserved for that column.

    The query still returns NULL when the LEFT OUTER JOIN to the side table cannot find a matching record, but the NULL is not persisted in an actual table.

    I think the huge complexity this would spawn outweighs the theoretical benefit of a "pure" database design.

Viewing 15 posts - 16 through 30 (of 46 total)

You must be logged in to reply to this topic. Login to reply