Large Table Design Best Practices

  • RonKyle (3/4/2016)


    I don't have a Closed column in the simplified design, but a Closed Date. Ideally this would be blank if the order is Open because the date doesn't exist, as opposed to being unknown. This is not possible with a date data type. But it doesn't warrant placing it in it's own table, requiring the primary key to be repeated.

    The date and datetime data types can be null. It adds additional logic as it always does with NULL values, regardless of data type. I'm not sure what you're saying here.

    I guess I don't understand at all where you're coming from in terms of normalization. My suggestion for lookup values is how I've been building databases for going on 30 years without it being a surprise to any of the other DBAs, architects and developers I've worked with. Also, I don't understand on your need for data prior to designing your tables. I've built many systems where we had no data, at all, just an idea for what we wanted the system to do. We were able to build the tables just fine.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • j-1064772 (3/4/2016)


    Sean Lange (3/3/2016)


    As the OP stated this data is created by multiple steps in the process. Let's say for example that InsuranceCarrierID would be required for this to be valid but that information happens in step 3. This means you have to allow NULL in the database because when the row is created it can't be known. However, logically it can't be NULL. This means you have to include business rules in the application to enforce that value be supplied in that step. As I said, this is a double edged sword with neither side being 100% correct.

    I have been impressed with the quality of your numerous contributions to SSQLCentral so I am interested in your opinion on a concept of database design: Allow or ban any persisted NULL. This has spawned a lengthy debate on the following thread (I do not remember seeing you or Jeff Moden there):

    Database Design Follies: NULL vs. NOT NULL - http://www.sqlservercentral.com/Forums/Topic1629695-3537-1.aspx

    Some contributors are advocating using a 1:1 table for each column that might or might not be NULL to avoid persisting a NULL column in the database.

    The gist of it is that if there is no specific vavlue that can be entered for a column, then there is no record in the side table reserved for that column.

    The query still returns NULL when the LEFT OUTER JOIN to the side table cannot find a matching record, but the NULL is not persisted in an actual table.

    I think the huge complexity this would spawn outweighs the theoretical benefit of a "pure" database design.

    It looks like that thread has been getting a lot of action lately. I don't want to hijack this thread with that discussion but I would say that trying to completely abolish NULL values in your database is absurd. Using some value instead of a NULL is crazy to me. You would have to be extremely careful that your NULL replacement values cannot possibly every be a valid value. And sometimes a NULL is absolutely the only right and correct value. If you don't know my birthday or my eye color how can you choose a "default" for those. What about hair color? This would only make sense to be NULL if you don't know, and even more so if all of my hair fell out. I think trying to get your system to a point where there can't be a NULL is beyond crazy and can lead to some serious performance issues with the requirement to have so many extra lookup tables.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I was too shy to express it so forcefully but yes I agree.

  • RonKyle (3/4/2016)

    I don't have a Closed column in the simplified design, but a Closed Date. Ideally this would be blank if the order is Open because the date doesn't exist, as opposed to being unknown. This is not possible with a date data type. But it doesn't warrant placing it in it's own table, requiring the primary key to be repeated.

    The date and datetime data types can be null. It adds additional logic as it always does with NULL values, regardless of data type. I'm not sure what you're saying here.

    No issues here. I of course know a date data type can be NULL. My only issue, and one for which there is no solution in an OLTP environment, is that it can't be a blank string. Ideally a NULL value means there may or may not be a value, and if there is a value, it isn't known. A person's birthdate can be Null, because everyone has them. But every open job does not have a complete date. Ideally I could insert a empty value similar to an empty string. But there's no way to do this. In an OLAP environment I can use a string value front end to distinguish between UNK and N/A, but behind the scenes I have to use a date well in the past.

    I guess I don't understand at all where you're coming from in terms of normalization. My suggestion for lookup values is how I've been building databases for going on 30 years without it being a surprise to any of the other DBAs, architects and developers I've worked with. Also, I don't understand on your need for data prior to designing your tables. I've built many systems where we had no data, at all, just an idea for what we wanted the system to do. We were able to build the tables just fine.

    I'm unsure what you mean by look up values. The design that spawned this thread seems to have columns that likely point to other tables. But that data has a 1:many, with the 1 being the lookup table data. As for designing, I've always had data to analyze when designing a system. Maybe I've been lucky, but I can't imagine how you do a design without understanding the data that will go into the database.

    It looks like that thread has been getting a lot of action lately. I don't want to hijack this thread with that discussion but I would say that trying to completely abolish NULL values in your database is absurd. Using some value instead of a NULL is crazy to me. You would have to be extremely careful that your NULL replacement values cannot possibly every be a valid value. And sometimes a NULL is absolutely the only right and correct value. If you don't know my birthday or my eye color how can you choose a "default" for those. What about hair color? This would only make sense to be NULL if you don't know, and even more so if all of my hair fell out. I think trying to get your system to a point where there can't be a NULL is beyond crazy and can lead to some serious performance issues with the requirement to have so many extra lookup tables.

    Hopefully you don't think I was advocating this. On allowing NULLs vs not allowing NULLs, I am firmly in the former camp. In my view those who don't creating unnecessary issues for themselves.

  • RonKyle (3/4/2016)


    It looks like that thread has been getting a lot of action lately. I don't want to hijack this thread with that discussion but I would say that trying to completely abolish NULL values in your database is absurd. Using some value instead of a NULL is crazy to me. You would have to be extremely careful that your NULL replacement values cannot possibly every be a valid value. And sometimes a NULL is absolutely the only right and correct value. If you don't know my birthday or my eye color how can you choose a "default" for those. What about hair color? This would only make sense to be NULL if you don't know, and even more so if all of my hair fell out. I think trying to get your system to a point where there can't be a NULL is beyond crazy and can lead to some serious performance issues with the requirement to have so many extra lookup tables.

    Hopefully you don't think I was advocating this. On allowing NULLs vs not allowing NULLs, I am firmly in the former camp. In my view those who don't creating unnecessary issues for themselves.

    I didn't get that from you at all. I was referring to the person who posted a link to that other thread and unfairly hijacking this thread with my own editorial. I should have left my comments in the other thread.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sorry guys. Neither of you appear in the other thread.:blush:

    Certainly blew up in my face.

    One lesson about the danger of cross-threading.

  • RonKyle (3/4/2016)


    No issues here. I of course know a date data type can be NULL. My only issue, and one for which there is no solution in an OLTP environment, is that it can't be a blank string.

    Perhaps I am misunderstanding you here. In a character data type, a blank string is a valid value and not the same as NULL (some of the data types in Oracle do not seen them as different, this is an implementation error in Oracle that goes right agains the ANSI standard. SQL Server never made this error).

    If a string column cannot be an empty string, add a check constaint: CHECK (Column <> ''). Either do or do not combine that with a NOT NULL constraint.

    Ideally a NULL value means there may or may not be a value, and if there is a value, it isn't known. A person's birthdate can be Null, because everyone has them. But every open job does not have a complete date. Ideally I could insert a empty value similar to an empty string. But there's no way to do this. In an OLAP environment I can use a string value front end to distinguish between UNK and N/A, but behind the scenes I have to use a date well in the past.

    Yes, you can use a string value to distinguish between UNK and N/A. And if you do, you should do so in a separate column. Let's say you have a string column PlaceOfBirth, which is unknown for some persons and n/a for some other persons (because they are legal entities). The PlaceOfBirth column should be nullable, and left NULL for both those categories. If it is important for the business to know the reason why a place of birth is missing for a person, then that should be stored in a separate column - trying to smash the distinct attributes "place of birth" and "reason place of birth is missing" in a single column, even though technically possible, would be a violation of first normal form - every attribute becomes its own distinct column.

    Change PlaceOfBirth to DateOfBirth and you change one data type, and you lose the ability to abuse a stirng column to implement a violation of first normal form. The proper design is still possible.

    As for designing, I've always had data to analyze when designing a system. Maybe I've been lucky, but I can't imagine how you do a design without understanding the data that will go into the database.

    When I don't have data available, I just make up some sample data. I have often experienced that using concrete realistic examples in discussions with subject matter experts prevents a whole lot of misunderstandings that do occur when the whole discussion is on an abstract level.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (3/4/2016)Let's say you have a string column PlaceOfBirth, which is unknown for some persons and n/a for some other persons (because they are legal entities).

    Of course this would still logically be NULL if you aren't sure if they are a person or a legal entity right because the distinguishing data is unknown right? Just playing devil's advocate here.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Does not this suggest that legal entities should be in a different table instead of mixing legal and moral persons ?

  • RonKyle (3/4/2016)

    No issues here. I of course know a date data type can be NULL. My only issue, and one for which there is no solution in an OLTP environment, is that it can't be a blank string.

    Perhaps I am misunderstanding you here. In a character data type, a blank string is a valid value and not the same as NULL (some of the data types in Oracle do not seen them as different, this is an implementation error in Oracle that goes right agains the ANSI standard. SQL Server never made this error).

    If a string column cannot be an empty string, add a check constaint: CHECK (Column <> ''). Either do or do not combine that with a NOT NULL constraint.

    Yes, you misunderstand, although maybe I could have been clearer (but don't have a whole lot of time to read over my work before posting, and in any case your ability to work in a foreign language is very impressive). If a column is some kind of character field, it can of course be nullable or an empty string. To my mind, those are very different, and I will use empty strings where others would use NULL. But I'm trying to make clear a blank string means that it is known that there is no value. I would use NULL if the existence or value of a data point is unknown.

    I'll have to review your suggested workaround at another time. But without that second column, a date field cannot be an empty string. I wish it were possible to specify in some way that the value is known not to have a value. So in this case the Null has to do double duty.

    Hope that helps.

  • Does not this suggest that legal entities should be in a different table instead of mixing legal and moral persons ?

    As always, it depends. But in this case I would use different schemas but same table names as a starting point.

  • Sean Lange (3/4/2016)


    Hugo Kornelis (3/4/2016)Let's say you have a string column PlaceOfBirth, which is unknown for some persons and n/a for some other persons (because they are legal entities).

    Of course this would still logically be NULL if you aren't sure if they are a person or a legal entity right because the distinguishing data is unknown right? Just playing devil's advocate here.

    The NULL in the PlaceOfBirth column has one and only one meaning: that we are unable to complete the fact "Person xyz was born in (...)". The information that should have gone on the ellipsis is not available, so this value is missing in the database. As far as the databse concerned, nothing beyond that is represented or can be implied from the NULL mark.

    Depending on context, the DB developers sometimes can infer further information from the NULL. But in that regard, NULL is not different from e.g. the integer value 1. For the database, a 1 represents nothing beyond just the integer number that follows 0 and precedes 2. But for me, in some contexts, it represents more - in a Gender column, it represents male (assuming ISO/IEC 5218 is used); in a CompanySize column it represents a potential customer of low interest, and in a PhoneCountryCode column it represent the USA (roughly). It's okay for database developers and users to use their knowledge of the data model to infer extra information from either "real" values (such as 1) or "missing" values (the NULL mark). But it's not okay for the database to any such inferring. For the database, 1 should always be just the number 1, and NULL should always be just a mark to represent the absence of a data value.

    j-1064772 (3/4/2016)


    Does not this suggest that legal entities should be in a different table instead of mixing legal and moral persons ?

    "It depends". Often, having a "Persons" table for the generic information and separate "LegalEntities" and "MoralPersons" tables (with one-to-one relationships) for the information specific for those categories makes a lot of sense. But not always. Sometimes we can just say that not all customers have a Chamber of Commerce number, make the column nullable, and move on. It depends on how impportant it is to have validated data, and on what we need to do with the data.

    I disagree with Ron Kyle on the different schemas idea. I would use different table names, since different entity types are represented in the population of the tables. Plus (from a more practical point of view), having the same table name in different schemas can be very confusing.

    RonKyle (3/4/2016)


    But I'm trying to make clear a blank string means that it is known that there is no value. I would use NULL if the existence or value of a data point is unknown.

    I'll have to review your suggested workaround at another time. But without that second column, a date field cannot be an empty string. I wish it were possible to specify in some way that the value is known not to have a value. So in this case the Null has to do double duty.

    A blank string does not mean that it is known that there is no value; a blank string means that it is known that there is a value, and that value is a blank string. Those are two different situations and you sometimes do have to distinguish between the two.

    What I suggested is not a workaround, it's the only correct way. But I understand the confusion, even great minds such as Chris Data and E.F. Codd have gotten this wrong. But nobody ever has the same confusion with "real" data values. Let's say that you collect phone numbers, and the majority of customers is in the USA. You would probably default the country code to 1, but it can be changed, and it can also be explicitly set to 1 (either at entry, or when someone moves from another country to the USA). No problem. Now while designing the data model you are told that for some reason, the company needs to track how many phone numbers with country code 1 have them because the default was accepted, and how many have been explicitly set or changed to 1. Would you now look for a way to store two different versions of 1, so you can distinguish between the "1 as a default" and the "1 explicitly entered" case? Or would you introduce a new column to track whether the 1 was explicitly entered or defaulted?

    After answering that, replace 1 with NULL and see check if your answer changes.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • A blank string does not mean that it is known that there is no value; a blank string means that it is known that there is a value, and that value is a blank string. Those are two different situations and you sometimes do have to distinguish between the two.

    Let me give a concrete example on this. In my ETL designs there is usually a table to take automated results. Records download, rejected, finally uploaded, etc. There is a comments column that defaults to a blank string. This is for someone to record the cause of a failure, corrective action, or some other note. Some designers would have that be NULL as they have the impression true or false that the SQL internals would handle that better. I don't consider that the comment is a blank string, I consider the blank string to represent that there is no comment. I would consider a NULL to represent that it is not know what, if any, comment there is.

    This difference might be what you describe as the developers ability to infer from data in the database, but the "database" itself cannot. I'm inferring something from the blank string that you might say that the database itself cannot infer.

    Another example: in my data warehouse ETL system a NULL for the closed date for an order that is open in the source system gets bucketed in "N/A". The same value for an order that is closed gets bucketed in "Unknown" as in combination with other information a different determination is possible.

  • RonKyle (3/5/2016)


    A blank string does not mean that it is known that there is no value; a blank string means that it is known that there is a value, and that value is a blank string. Those are two different situations and you sometimes do have to distinguish between the two.

    Let me give a concrete example on this. In my ETL designs there is usually a table to take automated results. Records download, rejected, finally uploaded, etc. There is a comments column that defaults to a blank string. This is for someone to record the cause of a failure, corrective action, or some other note. Some designers would have that be NULL as they have the impression true or false that the SQL internals would handle that better. I don't consider that the comment is a blank string, I consider the blank string to represent that there is no comment. I would consider a NULL to represent that it is not know what, if any, comment there is.

    This is a great example of how NULL and empty string are different.

    For this example, I would interpret NULL as "missing data" - perhaps it's n/a becuase there was no error, so no comment is needed; perhaps the comment has not yet been provided, or who knows what other reason - the NULL means that there is no comment.

    An empty string in that same comment column, I would interpret as "left blank intentionally". So perhaps there was an error but it is a common and well-known one and there is no need to provide a comment. Or perhaps an engineer was too lazy to type a comment and left it blank. Either way, someone made a conscious decision to leave the comment box on the screen empty, or perhaps evne to remove a default if one was provided.

    This difference might be what you describe as the developers ability to infer from data in the database, but the "database" itself cannot. I'm inferring something from the blank string that you might say that the database itself cannot infer.

    In either case we are infering something from the data that the database cannot. However, the interpretation by the database is, in my opinion, more consistent with my interpretation than with yours.

    E.g. run a query to get the average length of comments provided. LEN('') is 0 which reduces the average length; LEN(NULL) is NULL and is not included in the average. My interpretation of '' is that someone deliberately provided a zero-length comment, so this should reduce the average length; NULL values are not comments and should not be included when computing the average length of provided comments. Your interpretation of '' as "there is no comment" and NULL as "there is a comment but I don't know what it is" will give incorrect results, since all the rows where the comment is not applicable will be counted as zero and the average will be way too low.

    Another example: in my data warehouse ETL system a NULL for the closed date for an order that is open in the source system gets bucketed in "N/A". The same value for an order that is closed gets bucketed in "Unknown" as in combination with other information a different determination is possible.

    And that is a good example of how I believe it should be done.

    NULL in closed date means that there is no closed data available for that order. In your case you apparently need to distinguish between not available because n/a and not available because unknown, and you do that by looking at another attribute that defines which of the two cases applies. For strings, you should follow the same logic.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • I am somewhat surprised as to how much of conversation this question has raised.

    Replies to multiple comments:

    This is a new system that will replace an existing system, however additional requirements have been requested. I do not have access to the existing database, but that would not have all the data is required. The SME has an end state in mind but has some difficulties in articulating everything.

    Data tables will be normalized.

    There will be lookup tables where applicable (an example was State) to enforce CHECK constraints

    Referential integrity will be enforced

    Ten Centuries: A single commit would not involve multiple tables. The child tables would be updated at different times during the workflow, that could be days or weeks apart.

    To continue Sean’s example of multiple steps – there will be multiple data items required at different steps, none of which are know when the initial (base) record is created. The idea was that those fields would be placed in a table that would be completed during that step.

    To Hugo’s comment “I agree that having an explicit column to show the step (either as a number or as a code) is probably a much better idea.” I’m concerned that this would require additional logic to create a (somewhat automated) workflow. Under ideal situations this would work, but are some time/cost constraints.

    Lots of interesting reading that I need to digest.

Viewing 15 posts - 31 through 45 (of 46 total)

You must be logged in to reply to this topic. Login to reply