VARCHAR vs. CHAR if the column has mostly NULLs?

Question

VARCHAR vs. CHAR if the column has mostly NULLs?

Hall of Fame

Points: 3331

August 2, 2017 at 12:54 am

Sorry if this is a FAQ. I Googled and searched the forum before posting, but couldn't find an answer, specifically re: large number of NULLS.

My table has ~ 3M records. I have a column that SAS has automatically created as VARCHAR(2) (it creates everything as VARCHAR - I'm not sure if that's a function of SAS or the ODBC driver).

The vast majority of the columns are NULL. Here is a frequency distribution

MYCOLUMN COUNT PERCENT
2417871 .
01 3025 12.52951166
02 288 1.1928923497
03 2094 8.6733214596
04 191 0.7911195792
05 3512 14.546659487
06 603 2.4976183573
07 484 2.0047218656
08 211 0.8739593257
09 41 0.1698214803
10 294 1.2177442737
11 654 2.7088597109
12 235 0.9733670215
13 1022 4.2331110467
14 76 0.3147910367
15 6563 27.183862817
16 12 0.0497038479
17 131 0.5426003396
18 16 0.0662717972
19 3004 12.442529926
20 1662 6.883982935
99 25 0.1035496831

So, if I have 2.4M NULL values, and the rest are 2 characters, is CHAR(2) or VARCHAR(2) more efficient? What if the table is say 100M rows?

Or am I splitting hairs?

I know VARCHAR(n) = actual data length + 2 bytes: https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql

What I'm unsure about is the best approach when most values are NULL? For much of my data, the length *doesn't* vary much within a column, but it can vary from mostly NULL to mostly NOT NULL across columns.

Refactoring this 400 column table is a PITA, so if there's a general (time saving) rule like "use varchar if length < 10", let me know and I'll run with that.

I'm happy to be educated, so if I've missed an obvious link that explains this my apologies, please point me in the right direction.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 1

There's not going to be a giant performance difference between CHAR(2) and VARCHAR(2) when it comes to NULL values. When the value is NULL, there is no length stored with the VARCHAR because there is no length to record. NULL doesn't have a length. It's an empty set.

For me, if you have a value that is absolutely fixed at 2, then CHAR(2) makes more sense. It's always going to be 2, so why record the length. However, if there are values in there with a length of 1 too... it's a dance as to which makes more sense. It's possible that INSERT & UPDATE operations might be a little bit faster for the fixed length column. However, if this data doesn't get lots of that type of manipulation, who cares. Further, if there is a VARCHAR(400) right next to the VARCHAR(2) and they're both getting updated, the time saving you might achieve on that CHAR(2) is not worth it for most of us, most of the time.

Notice the weasel words. In some instances, if you can shave 5ms of a query, that's an enormous savings. It's my experience that most of the time, peoples queries are running in 2-3 seconds and if we can get them down to 500ms, it's an enormous win. Very seldom are people running queries in the 25ms range that are desperate to get another 5ms off of it. However, it absolutely does happen, hence the weasel words.

In short, the real answer is, it depends, on your situation. If you are looking for that last millisecond or three to shave off, for very small character lengths, it's worth testing to see if, in your situation, CHAR vs. VARCHAR makes a difference.

Just don't do VARCHAR(1). That makes me crazy.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Steve Jones - SSC Editor SSC Guru Points: 736494 More actions · Answer 2

I tend to agree with Grant. Ultimately, if you need to change to a large value, I don't know if the CHAR->larger VARCHAR is worse than a VARCHAR->larger VARCHAR. In either case, it's a schema change and you'll likely end up rearranging data on pages when you update it.

I tend to CHAR() only for values that are fixed in size, like a countrycode at 3. Even then, usually only if it's 5 or less. Beyond that length, I rarely believe anyone truly has thought through the future data modeling implications, so I just go ahead and allow for more.

Here, I wouldn't mess with it. There might be some weird thing in the app that causes an issue with CHAR.

Scott In Sydney Hall of Fame Points: 3331 More actions · Answer 3

Thanks all...

When I posted about VARCHAR(2) vs. CHAR(2), I meant it to just be an illustrative example. I was hoping to wrap my head about VARCHAR vs. CHAR in general.

However, here's what I did:

* It was easier to just let SAS create the initial table via the ODBC driver.
* In SSMS, I global search and replaced all VARCHAR(1) to CHAR(1) and VARCHAR(2) to CHAR(2)
* For VARCHAR(3), if it was mostly NULL (0+2), I still changed it to CHAR(3)
* For the others, I just left them VARCHAR(#). It wasn't worth my programmer time to get a frequency distribution (via SAS) or SELECT DISTINCT and then review the existing data.
* My understanding is that CHAR is a little more efficient than VARCHAR, since it doesn't have the "two byte to determine the length of the data" processing overhead.
* However, for wide, varying length data, VARCHAR gives an I/O benefit due to smaller table size.
* I was more concerned about performance vs. table size.
* Once the table schema was modified, I changed the ETL to truncate and append, which preserved the improved schema.

I'll reference your replies if I have to undertake this exercise again in the future. Thanks again.

Eric M Russell SSC Guru Points: 125562 More actions · Answer 4

If these columns will potentially be updated at a later point, then stick with CHAR, because updating NULL-able VARCHAR columns can result in change of row size, page splits, and fragmentation. However, setting page fill to something like 70 - 80% can mitigate this.

Also, if you are concerned about how the column definitions will ultimately impact file allocation size, consider that row or page level compression can also mitigate storage requirement of mostly NULL columns.

Still, 3 million row tables are not considered large. The design of OLTP tables matter a lot, but folks don't expect much from the design of staging tables. If these tables get truncated and reloaded by SAS for reporting purposes, then I wouldn't worry about it too much.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

MYCOLUMN	COUNT	PERCENT
	2417871	.
01	3025	12.52951166
02	288	1.1928923497
03	2094	8.6733214596
04	191	0.7911195792
05	3512	14.546659487
06	603	2.4976183573
07	484	2.0047218656
08	211	0.8739593257
09	41	0.1698214803
10	294	1.2177442737
11	654	2.7088597109
12	235	0.9733670215
13	1022	4.2331110467
14	76	0.3147910367
15	6563	27.183862817
16	12	0.0497038479
17	131	0.5426003396
18	16	0.0662717972
19	3004	12.442529926
20	1662	6.883982935
99	25	0.1035496831