Brent Ozar (blog|twitter) is our host this month, which will make this the most popular #tsql2sday to date I’m sure. He’s asked us to talk about our favorite (or least favorite) data type. I can actually guess what a few people are going to write about but that’s neither here nor there. I actually want to talk about a group of data types more than an individual one.
Did you know that there are four different integer datatypes? It’s not just int.
Data type | Range | Storage |
---|---|---|
bigint | -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807) | 8 Bytes |
int | -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647) | 4 Bytes |
smallint | -2^15 (-32,768) to 2^15-1 (32,767) | 2 Bytes |
tinyint | 0 to 255 | 1 Byte |
If you look at the above chart you’ll see that each of the four is not only different in size, but dramatically different in the number of values it can store. Most people just use int everywhere. And while int will work the vast majority of the time, that doesn’t mean it’s the correct datatype. A very simple example would be using the int datatype for an identity column in a table where you are going to be adding millions of rows a day. In fact I recently read an article where a social media site did this. Unsurprisingly in a very short amount of time they hit the 2 billion row limit for positive values (because of course they didn’t start at the min negative value) and crashed. This would have been a time where using bigint would have been more appropriate. Just so you are aware, at one million rows a second it would take almost three hundred thousand years to fill a bigint. And that’s just the positive values. Now you pay for that in space. bigint is 4 bytes larger than int translating into a little less than 4mb per million rows increase. It doesn’t seem like much but at the point where you need to switch to bigint you are paying 16gb (~4.2 billion rows * 4 bytes more than int / 1024 / 1024 = ~8000). And that’s just where you are starting.
Now, speaking of space, let’s go the other direction. Let’s say you have a data warehouse with lookup tables for employees and another for sales districts. You decide to use an integer identity column. Now, 4 billion employees seems like a bit much even for a large company. How about 64 thousand? It’s highly unlikely that you’ll ever have that many employees even accounting for attrition and growth. So smallint is a reasonable possibility. There is also tinyint but that would only let us have 256 employees which even for a small company over time might not be enough. So smallint it is. Of course at 64k rows that’s a grand savings of 125kb. (~64 thousand rows * 2 bytes less than int / 1024). That’s not really that much hu? And for our sales districts table we could probably use that tinyint which will net us a grand total of 768 bytes. Not even 1 kb. (256 rows * 3 bytes less than int). So why bother?
Well, let’s say our multi billion row table is our fact table. Just by changing these two lookup values we’ve saved more than we lost by switching to bigint. Now try that with 10 or 20 lookup tables. The savings can add up pretty dramatically.