August 2, 2017 at 12:54 am
Sorry if this is a FAQ. I Googled and searched the forum before posting, but couldn't find an answer, specifically re: large number of NULLS.
My table has ~ 3M records. I have a column that SAS has automatically created as VARCHAR(2) (it creates everything as VARCHAR - I'm not sure if that's a function of SAS or the ODBC driver).
The vast majority of the columns are NULL. Here is a frequency distribution
MYCOLUMN COUNT PERCENT 2417871 . 01 3025 12.52951166 02 288 1.1928923497 03 2094 8.6733214596 04 191 0.7911195792 05 3512 14.546659487 06 603 2.4976183573 07 484 2.0047218656 08 211 0.8739593257 09 41 0.1698214803 10 294 1.2177442737 11 654 2.7088597109 12 235 0.9733670215 13 1022 4.2331110467 14 76 0.3147910367 15 6563 27.183862817 16 12 0.0497038479 17 131 0.5426003396 18 16 0.0662717972 19 3004 12.442529926 20 1662 6.883982935 99 25 0.1035496831
So, if I have 2.4M NULL values, and the rest are 2 characters, is CHAR(2) or VARCHAR(2) more efficient? What if the table is say 100M rows?
Or am I splitting hairs?
I know VARCHAR(n) = actual data length + 2 bytes: https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql
What I'm unsure about is the best approach when most values are NULL? For much of my data, the length *doesn't* vary much within a column, but it can vary from mostly NULL to mostly NOT NULL across columns.
Refactoring this 400 column table is a PITA, so if there's a general (time saving) rule like "use varchar if length < 10", let me know and I'll run with that.
I'm happy to be educated, so if I've missed an obvious link that explains this my apologies, please point me in the right direction.
August 2, 2017 at 5:51 am
There's not going to be a giant performance difference between CHAR(2) and VARCHAR(2) when it comes to NULL values. When the value is NULL, there is no length stored with the VARCHAR because there is no length to record. NULL doesn't have a length. It's an empty set.
For me, if you have a value that is absolutely fixed at 2, then CHAR(2) makes more sense. It's always going to be 2, so why record the length. However, if there are values in there with a length of 1 too... it's a dance as to which makes more sense. It's possible that INSERT & UPDATE operations might be a little bit faster for the fixed length column. However, if this data doesn't get lots of that type of manipulation, who cares. Further, if there is a VARCHAR(400) right next to the VARCHAR(2) and they're both getting updated, the time saving you might achieve on that CHAR(2) is not worth it for most of us, most of the time.
Notice the weasel words. In some instances, if you can shave 5ms of a query, that's an enormous savings. It's my experience that most of the time, peoples queries are running in 2-3 seconds and if we can get them down to 500ms, it's an enormous win. Very seldom are people running queries in the 25ms range that are desperate to get another 5ms off of it. However, it absolutely does happen, hence the weasel words.
In short, the real answer is, it depends, on your situation. If you are looking for that last millisecond or three to shave off, for very small character lengths, it's worth testing to see if, in your situation, CHAR vs. VARCHAR makes a difference.
Just don't do VARCHAR(1). That makes me crazy.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 2, 2017 at 9:37 am
I tend to agree with Grant. Ultimately, if you need to change to a large value, I don't know if the CHAR->larger VARCHAR is worse than a VARCHAR->larger VARCHAR. In either case, it's a schema change and you'll likely end up rearranging data on pages when you update it.
I tend to CHAR() only for values that are fixed in size, like a countrycode at 3. Even then, usually only if it's 5 or less. Beyond that length, I rarely believe anyone truly has thought through the future data modeling implications, so I just go ahead and allow for more.
Here, I wouldn't mess with it. There might be some weird thing in the app that causes an issue with CHAR.
August 3, 2017 at 9:36 pm
Thanks all...
When I posted about VARCHAR(2) vs. CHAR(2), I meant it to just be an illustrative example. I was hoping to wrap my head about VARCHAR vs. CHAR in general.
However, here's what I did:
* It was easier to just let SAS create the initial table via the ODBC driver.
* In SSMS, I global search and replaced all VARCHAR(1) to CHAR(1) and VARCHAR(2) to CHAR(2)
* For VARCHAR(3), if it was mostly NULL (0+2), I still changed it to CHAR(3)
* For the others, I just left them VARCHAR(#). It wasn't worth my programmer time to get a frequency distribution (via SAS) or SELECT DISTINCT and then review the existing data.
* My understanding is that CHAR is a little more efficient than VARCHAR, since it doesn't have the "two byte to determine the length of the data" processing overhead.
* However, for wide, varying length data, VARCHAR gives an I/O benefit due to smaller table size.
* I was more concerned about performance vs. table size.
* Once the table schema was modified, I changed the ETL to truncate and append, which preserved the improved schema.
I'll reference your replies if I have to undertake this exercise again in the future. Thanks again.
August 4, 2017 at 12:32 pm
If these columns will potentially be updated at a later point, then stick with CHAR, because updating NULL-able VARCHAR columns can result in change of row size, page splits, and fragmentation. However, setting page fill to something like 70 - 80% can mitigate this.
Also, if you are concerned about how the column definitions will ultimately impact file allocation size, consider that row or page level compression can also mitigate storage requirement of mostly NULL columns.
Still, 3 million row tables are not considered large. The design of OLTP tables matter a lot, but folks don't expect much from the design of staging tables. If these tables get truncated and reloaded by SAS for reporting purposes, then I wouldn't worry about it too much.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply