October 24, 2014 at 10:16 am
Coincidence has it this editorial came one a day after I was feeling strongly we miss a few important types in SQL Server.
Lets start with something very close to the topic (in reverse)
The type varchar(N) is stored as 2 bytes for length, followed by one byte for each character (2+N). Check your tables and you will see that for the majority of columns, this is rather wasteful as most column data is in smaller then 256 characters (or even 64). Typically codes and logical names as well as most labels are rather short!
My thought was, I miss the smallvarchar, smallnvarchar and smallvarbinary types and the guaranteed small size of these types could work well with some special optimizations during query plan generation.
Another thought was to make the number of bytes used to store the length dynamic.
using two bytes to store the length of a 6 byte "code" is insane, any way you look at it.
A scheme could be used like in UFT-8 where most common lengths (<128) can be stored in just one byte (more efficient encoding is possible).
This gives a nice idea: http://en.wikipedia.org/wiki/UTF-8#Description
Such a scheme would also be compatibly with page sizes larger then 8K.
The problem I worked on
I have to store IP numbers in a log table and always storing 16 bytes (for IP6) is out of the question. For IP4 I would just need 4 bytes and it can be stored as an integer with some simple conversion logic. Future proofing a system for IP6 means either separate null-able fields for IP4 and IP6 that can complicate code or a mapping table of sorts and both are undesirable to me (for this).
In the end I settled on varbinary(16) for this purpose as it has simple conversion to int (just cast it) and this is enough for IP4 which we start with. Besides the length difference of 4 for IP4 and 16 for IP6, it also stores the fact if IP4 or IP6 was supplied before logging. Still, using two bytes for that pains me and I rather had used just one byte for this.
What we are really missing is an IP type that handles this sort of common thing and does so very efficiently!
Other frequent problems with data type growth
Deciding when to use tinyint, smallint, int or bigint often brings competing requirements in conflict.
Do we know for certain and in advance how much "number space" a table is going to require?
Often the answer is NO and we usually allocate too much space just to be on the safe side.
Just as text size is variable in nature with extreme cases on the norm, so are integers. I would love a variable length encoded bigint (lets call it varbigint) that uses no more then a byte for very small numbers, yet is treated in all regards (except storage) as a true bigint. Such a type would make an ideal log table primary key and same for other ever growing tables. It would also serve well for numerical columns that almost never get updated with values that require another amount of storage as the previously stored value. Especially when they are null-able too! Lost of candidates here as much data is only recorded and small values are more frequent the large values!
While this would save a lot of space and make modeling choices easier (very valuable and cost reducing), I see this not as compression as there are no patterns between values exploited. The depreciation of vardecimal in SQL Server 2014 in favor of compression makes this a rather unlikely scenario to ever see the light of day. It seems Microsoft is hell-bend on making every "bad call" they can possibly make regarding SQL Server. Especially in the light that compression is an enterprise only feature!
I better stop ranting now 🙁
October 24, 2014 at 11:34 am
I do like some of your ideas Peter but my concern would be the performance overhead. When updating a variably size number that is larger than the currently allocated size? What about inserts that also affect paging? What about considerations such as built in numeric functions - should they "automatically" intelligently (re)size return values?
Also, some of them remind me of the COM variant type that was just bad news.
Gaz
-- Stop your grinnin' and drop your linen...they're everywhere!!!
October 24, 2014 at 11:51 am
Gary Varga (10/24/2014)
I do like some of your ideas Peter but my concern would be the performance overhead. When updating a variably size number that is larger than the currently allocated size? What about inserts that also affect paging? What about considerations such as built in numeric functions - should they "automatically" intelligently (re)size return values?Also, some of them remind me of the COM variant type that was just bad news.
Well i think most of your concerns are less likely then you might think. How a value is stored is from the relational perspective invisible. You decode when you read and encode when you write. In-memory its just like any other bigint, varchar, etc. You are right in pointing out that updating variable types costs performance...when the size changes. Which in my opinion is with integers far less frequent then with say a varchar. So if the storage engine can handle varchar well, it certainly can handle variable length integers well.
October 24, 2014 at 12:57 pm
To be honest Peter I was just voicing concerns that I would want calculated at a design stage and measured in a prototype. If I was part of the development team, that is.
Gaz
-- Stop your grinnin' and drop your linen...they're everywhere!!!
Viewing 4 posts - 31 through 33 (of 33 total)
You must be logged in to reply to this topic. Login to reply