Disk Is Cheap! ORLY?

  • Thomas-282729 (1/1/2011)


    I could make the same argument against any data type. "Think of the savings you are wasting by using a 4-byte integer instead of a 2 byte integer for a primary key!" In the grander scheme of things, what matters is whether affects actual real world performance given actual data sizes. The vast majority of databases do not even contain a million rows must less 10 million in any one table and thus tiny differences in datatypes will make no perceptible difference. What does make a significant difference are data integrity rules.

    All things being equal, there is no reason not to take advantage of methods to be as efficient as you can. We are not talking about data integrity; we are talking about data validity. If your system has good reason to make use of an INT instead of TINYINT or DATETIME instead of SMALLDATETIME then by all means do that as you would be negligent not to. I am talking about not being wasteful. And I do not buy the argument that the "vast majority of databases do not even contain a million rows". To whatever extent that is true at the moment, it is a decreasing trend.

    This is more an argument about archive tables than it is in-use tables. I've seen a few systems that had a one or two tables in the 50 million row+ range and those were primarily archive, auditing tables or data warehouses (where you can play tricks to cut down the pointer size). However, let's not forget that archive tables serve a very different purpose than in-use tables.

    Yes they have a different purposes but there are benefits to be gained by some foresight for both archive and transactional tables, as has already been discussed.

    When modeling, the choice of implementation data type (vs the data type in the model) is secondary. You do not care whether you use an int, smallint, tinyint or bit when modeling a database. What matters is that the value is an integer with some given range. That the DBA decides they can use a smallint instead of a 4-byte integer is done at the implementation stage. Still important, but not a part of modeling.

    Data-modeling is two parts: logical then physical. Implementation that is so far removed from the reasoning for the logical design is much more prone to error. And again, nobody is talking about constraining the logical model to fit into the physical model. We are talking about making good implementation choices based on the logical model and the reasoning behind the logical model.

    A few extra minutes of work now can save many hours of work in the future so why not do it?

    Because it can also cost you a significant amount in development time. If you chose a tinyint for example and after the system goes into production it is determined that this is too small, it is likely that quite a bit will have to be changed.

    This is why we are talking about making reasonable choices as opposed to arbitrarily using the smallest datatypes available. In my experience, I have spent more time reducing datatypes than increasing them because the vast majority of the time people choose INT over TINYINT to store values from 1 to 5.

    Completely disagree. I've run into many issues with that ridiculous max value. Example 1: you have a table that tracks when an entity is "valid". You have a start and end date. Null end date = currently available. The client side code wants to use a max datetime value to represent that some entity is valid in perpetuity. In most programming languages, that date is 9999-12-31. Can't do it with smalldatetime. You have to store a mystery constant somewhere in the middle tier code that is used to represent that value. Example 2: copyrights. 10 years ago, you might have thought it safe that no copyright could extend beyond 2079 for something today or in the past. Surprise, surprise thanks to Disney, now it is a 100 years. Example 3: contracts. Yes, I've run into contracts that specify certain rights 20 and 50 years into the future. Saving those four bytes, in the long run, simply do not provide nearly enough benefit to justify them. If you are worried about storing time values, then add a check constraint which prevents time values from being stored in that datetime field.

    You only disagree because you missed the point I was making. The point is: be sensible. That means using the best datatype for the purpose. Using an INT to store values 1 - 5 is not sensible. Using a DATETIME to store future dates that might be 20 - 40 years out is entirely sensible and nobody would begrudge you that choice. So your examples are all instances where the business-case dictates that DATETIME is the correct choice. Great. This takes nothing away from what I (and others) have been saying. And that you found an edge-case where copyright law was changed is just that: an edge-case. And if there is some distinction between "currently available" and "always available", then again that is either a good case for DATETIME or maybe a different modeling choice to begin with.

    As for conversion, you better hope that your ORM does not account for the difference between smalldatetime and datetime. If so, that means additional unit testing to ensure that the change of the datatype will not break the existing code or that the code accommodates the expanded values.

    1) People use ORMs because they provide an easy data-layer. It takes little to no effort to update the ORM model.

    2) Why would you even mention testing in the sense that it might not happen? Don't all changes get tested? Not only was the article about future planning which implies that it is up-front work that is already being tested, even if we go back to make changes the assumption is that regression testing is still be done. Regression (not Unit) testing is assumed to be part of a project whether it is a new feature or updated feature (which includes changes to datatypes).

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • rlobbe (1/1/2011)


    What no one has mentioned is that you also have to consider the mechanics of the I/O.

    SQL Server is going to fetch a block (or several) at a time, not a arbitrary number of bytes.

    Actually, this specifically was stated in the article. I mentioned that smaller rows give a greater probability of more rows fitting onto a page. However, I mentioned it from the forward-looking perspective in that using INT PKs for LookUp tables that have values that never even get close to 255 causes tables that have 5 or so FKs to various LookUp tables to have larger than necessary rows and hence fewer fit on a page.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Nadrek (1/3/2011)


    Essentially; know your system, know your data, know your bottlenecks, know the plateaus, know your growth patterns, and keep an eye on future technology.

    You have limited CPU, limited disk space, limited disk IO, limited memory (note: taking more space on disk also takes more space in RAM), limited time, limited concurrency, and limited humans. All these factors must be balanced.

    Yes, this is exactly the point I was making.

    Now, if SQL Server gave us a comprehensive range of integer datatypes, we'd be a lot better off; we should have both signed and unsigned 8, 16, 32, and 64 bit integers. We don't.

    I agree: UN-signed INTs (of the various sizes) would be great!

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Charles Kincaid (1/3/2011)


    You have to apply all of these through a filter of sense.

    Of course. That is what the article was about: making sensible/reasonable choices.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • timothyawiseman (1/3/2011)


    When I am thinking about the trade offs for creating an index, I will consider carefully how often the table is written to and the trade offs in write performance, but I won't give a second thought to the disk space because it is cheap relative to making my users wait. I will create complicated and large indexed views in the same way.

    Similarly, I tend to log procedures extensively. This costs disk space, but that disk space is cheap compared to having a problem in production and not being able to track down in detail when, why, how, and who caused it along with enough information to know what the answer should have been.

    I completely agree. And that is the point of the article: since the table is the foundation for where that data will keep showing up (indexes, log tables, etc.) then making good choices at the beginning makes it that much easier to create whatever indexes and logging are needed. Those are valid business reasons to use disk space and their impact to performance will be "appropriate" for the benefit of having the feature if the columns making them up were not arbitrarily wasting space to begin with.

    So, I agree, it was a good article with a good point and one that developers and DBAs should remember. I just want to caution against taking it to an extreme. Disk space should never be wasted, but at least in the situations I regularly see trading disk space for detailed logging or improved perofrmance is almost always a good trade. Disk space is cheap enough to justify that.

    Thanks. And the intention of the article was not that anyone would go to the extreme but instead they would start thinking about things that they previously had not and in doing so make better choices. It is not a matter of sacrificing something that is necessary; it is a matter of not doing things that are unnecessary. I am not saying: do not use the disk. I am saying: try to use disk wisely because it is far too easy to use it foolishly.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Wish I had a few more databases of sub-billion row tables:-P

    The thing that has made MPP so attractive in the BI world is the explosion in data volumes. Not only are people wanting to analyse vast data sets but they are wanting to do it in near real time.

    Clickstream data for a fair sized e-commerce site will easily blast through the billion record table limit. Anything that collects mechanical data is likely to have to deal with huge data volumes. Joderell Bank generates 6TB/minute, or it did 3 years ago.

  • Fantastic article with some really neat real-world examples. Thanks for taking the time to write it! 😀

  • I know I am a few months late to this game, but I wanted to give you a compliment and issue a small caveat.

    Compliment: great article. It is reassuring, for me anyway as an RDBMS-agnostic designer, that I wholeheartedly agree with your physical modeling approaches. Sizing indexes by understanding column data types is critically important. Nulls matter, as does premature denormalization. I already apply 9 of the 10 suggestions in practice and all 10 in spirit, which leads me to the caveat.

    Caveat: If you are in a Microsoft technology-only shop with .NET, C# and other such technologies then no worries. Otherwise be careful with using non-ANSI standard data types like "tinyint" and "bit". The issue isn't one of ANSI compliance for its own sake, but for the sake of compatibility with non-Microsoft external drivers, add-on reporting systems and integration tools.

    Once again thanks for the confirmation on my physical data modeling approaches. - Todd

  • Todd M. Owens (4/26/2011)


    Compliment: great article.

    Caveat: If you are in a Microsoft technology-only shop with .NET, C# and other such technologies then no worries. Otherwise be careful with using non-ANSI standard data types like "tinyint" and "bit".

    Hello Todd. Thanks for the compliment :-). And thanks for mentioning the potential issue with non-standard datatypes. I do not specifically deal with that caveat for two reasons. First, I had not thought of it since I have never run into any problems ;-). Second, I feel that by properly testing, unintended consequences from datatype decisions will be found. So if one starts out in their development environment with these changes, then any problems will not even make it to the QA environment.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Oh another thing that chews up space is thoughtlessly adding or misusing a set of audit columns to every single table regardless of how the table is used. For example, if a table only allows inserts in a version type approach, there is no reason to add columns that capture UpdateDate and/or UpdateUserID. If a table's use case(s) calls for a row to be updated, only then are the Update fields necessary.

    And please think twice before setting the CreateDate and the UpdateDate to the same exact value on insert: this a clear violation of the limited redundancy principle. If 90% or more of the columns are never updated, then there will be an extraneous 8 bytes on each row from the UpdateDate; multiply that by the number of rows where it should be null and in every table where it is misused, then the numbers get quite large quite quickly.

  • Todd M. Owens (4/27/2011)


    ...if a table only allows inserts in a version type approach, there is no reason to add columns that capture UpdateDate and/or UpdateUserID. If a table's use case(s) calls for a row to be updated, only then are the Update fields necessary.

    And please think twice before setting the CreateDate and the UpdateDate to the same exact value on insert...If 90% or more of the columns are never updated, then there will be an extraneous 8 bytes on each row from the UpdateDate...

    Hey there. I definitely agree with both of these and practice them myself.

    There might be a circumstance, however, to have both CreateDate and UpdateDate be set upon initial INSERT (and hence UpdateDate will be the same value as CreateDate if the row is never updated). I have seen cases where an ETL process looks at the UpdateDate field to know if it should grab that row and will have the UpdateDate field indexed for that purpose. In this case UpdateDate serves dual-duty by indicating both new and updated rows without having to scan both CreateDate and UpdateDate fields. And to only have a single field (assuming someone might suggest not having a CreateDate field in this case), it is still helpful to see the CreateDate since that information will be gone once the row is updated if you only have the single UpdateDate field. But outside of this specific case I tend to agree that the UpdateDate field should be NULL.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Todd M. Owens (4/27/2011)


    And please think twice before setting the CreateDate and the UpdateDate to the same exact value on insert: this a clear violation of the limited redundancy principle. If 90% or more of the columns are never updated, then there will be an extraneous 8 bytes on each row from the UpdateDate; multiply that by the number of rows where it should be null and in every table where it is misused, then the numbers get quite large quite quickly.

    Hey Todd. I did not notice until just a moment ago when I was working on something else that this statement is not entirely true. While I still do agree that using NULL for UpdateDate when a record has not yet been updated is a better practice, that is merely just a logical consideration (outside of the potential indexing issue I mentioned before). But a fixed-width column (numeric types, date/time types, and char/nchar/binary types) will always take up the same amount of space regardless if it is set to NULL or a value. Meaning, if UpdateDate is a DATETIME field, it will always take up 8 bytes even if it is NULL. And to make sure I did just test this out.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • But a fixed-width column (numeric types, date/time types, and char/nchar/binary types) will always take up the same amount of space regardless if it is set to NULL or a value. Meaning, if UpdateDate is a DATETIME field, it will always take up 8 bytes even if it is NULL. And to make sure I did just test this out.

    Hmm I wonder if that is true for Oracle, DB2, Teradata, et.al....

  • Solomon,

    I've recently had the need to refer to this article again and I thank you from the bottom of my heart. This "right sizing on steroids" article should be required reading for anyone and everyone who has or will ever write the words CREATE TABLE. My hat is off to you, good Sir. Well done!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/5/2012)


    Solomon,

    I've recently had the need to refer to this article again and I thank you from the bottom of my heart. This "right sizing on steroids" article should be required reading for anyone and everyone who has or will ever write the words CREATE TABLE. My hat is off to you, good Sir. Well done!

    Hi Jeff and thank you very much for such positive feedback. I apologize for the delayed response but new baby came along less than 2 weeks before your comment and things have been, well, C-R-A-Z-Y!! :hehe: Also, I had thought of a few minor additions to make and wanted to get them in and published before replying.

    I have added:

    • A reference to decreased Page Life Expectancy in the paragraph about memory usage
    • Info about several other datatypes at the end of the main recommendations list, things like: SMALLMONEY, REAL, TIME, etc.
    • Info about some newer features that can help save space in addition to (not in place of) a good model. Namely: SPARSE columns and Row / Page compression

    Hopefully the updates make for an even stronger article :-).

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 15 posts - 61 through 75 (of 101 total)

You must be logged in to reply to this topic. Login to reply