VARCHAR vs. CHAR if the column has mostly NULLs?

  • Sorry if this is a FAQ.  I Googled and searched the forum before posting, but couldn't find an answer, specifically re: large number of NULLS.

    My table has ~ 3M records.  I have a column that SAS has automatically created as VARCHAR(2) (it creates everything as VARCHAR - I'm not sure if that's a function of SAS or the ODBC driver).

    The vast majority of the columns are NULL.  Here is a frequency distribution

    MYCOLUMNCOUNTPERCENT
    2417871.
    01302512.52951166
    022881.1928923497
    0320948.6733214596
    041910.7911195792
    05351214.546659487
    066032.4976183573
    074842.0047218656
    082110.8739593257
    09410.1698214803
    102941.2177442737
    116542.7088597109
    122350.9733670215
    1310224.2331110467
    14760.3147910367
    15656327.183862817
    16120.0497038479
    171310.5426003396
    18160.0662717972
    19300412.442529926
    2016626.883982935
    99250.1035496831

    So, if I have 2.4M NULL values, and the rest are 2 characters, is CHAR(2) or VARCHAR(2) more efficient?  What if the table is say 100M rows?

    Or am I splitting hairs?

    I know VARCHAR(n) = actual data length + 2 bytes:  https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql

    What I'm unsure about is the best approach when most values are NULL?  For much of my data, the length *doesn't* vary much within a column, but it can vary from mostly NULL to mostly NOT NULL across columns.

    Refactoring this 400 column table is a PITA, so if there's a general (time saving) rule like "use varchar if length < 10", let me know and I'll run with that.

    I'm happy to be educated, so if I've missed an obvious link that explains this my apologies, please point me in the right direction.

  • There's not going to be a giant performance difference between CHAR(2) and VARCHAR(2) when it comes to NULL values. When the value is NULL, there is no length stored with the VARCHAR because there is no length to record. NULL doesn't have a length. It's an empty set. 

    For me, if you have a value that is absolutely fixed at 2, then CHAR(2) makes more sense. It's always going to be 2, so why record the length. However, if there are values in there with a length of 1 too... it's a dance as to which makes more sense. It's possible that INSERT & UPDATE operations might be a little bit faster for the fixed length column. However, if this data doesn't get lots of that type of manipulation, who cares. Further, if there is a VARCHAR(400) right next to the VARCHAR(2) and they're both getting updated, the time saving you might achieve on that CHAR(2) is not worth it for most of us, most of the time.

    Notice the weasel words. In some instances, if you can shave 5ms of a query, that's an enormous savings. It's my experience that most of the time, peoples queries are running in 2-3 seconds and if we can get them down to 500ms, it's an enormous win. Very seldom are people running queries in the 25ms range that are desperate to get another 5ms off of it. However, it absolutely does happen, hence the weasel words.

    In short, the real answer is, it depends, on your situation. If you are looking for that last millisecond or three to shave off, for very small character lengths, it's worth testing to see if, in your situation, CHAR vs. VARCHAR makes a difference.

    Just don't do VARCHAR(1). That makes me crazy.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I tend to agree with Grant. Ultimately, if you need to change to a large value, I don't know if the CHAR->larger VARCHAR is worse than a VARCHAR->larger VARCHAR. In either case, it's a schema change and you'll likely end up rearranging data on pages when you update it.

    I tend to CHAR() only for values that are fixed in size, like a countrycode at 3. Even then, usually only if it's 5 or less. Beyond that length, I rarely believe anyone truly has thought through the future data modeling implications, so I just go ahead and allow for more.

    Here, I wouldn't mess with it. There might be some weird thing in the app that causes an issue with CHAR.

  • Thanks all...

    When I posted about VARCHAR(2) vs. CHAR(2), I meant it to just be an illustrative example.  I was hoping to wrap my head about VARCHAR vs. CHAR in general.

    However, here's what I did:

    * It was easier to just let SAS create the initial table via the ODBC driver.
    * In SSMS, I global search and replaced all VARCHAR(1) to CHAR(1) and VARCHAR(2) to CHAR(2)
    * For VARCHAR(3), if it was mostly NULL (0+2), I still changed it to CHAR(3)
    * For the others, I just left them VARCHAR(#).  It wasn't worth my programmer time to get a frequency distribution (via SAS) or SELECT DISTINCT and then review the existing data.
    * My understanding is that CHAR is a little more efficient than VARCHAR, since it doesn't have the "two byte to determine the length of the data" processing overhead.
    * However, for wide, varying length data, VARCHAR gives an I/O benefit due to smaller table size.
    * I was more concerned about performance vs. table size.
    * Once the table schema was modified, I changed the ETL to truncate and append, which preserved the improved schema.

    I'll reference your replies if I have to undertake this exercise again in the future.  Thanks again.

  • If these columns will potentially be updated at a later point, then stick with CHAR, because updating NULL-able VARCHAR columns can result in change of row size, page splits, and fragmentation. However, setting page fill to something like 70 - 80% can mitigate this.

    Also, if you are concerned about how the column definitions will ultimately impact file allocation size, consider that row or page level compression can also mitigate storage requirement of mostly NULL columns.

    Still, 3 million row tables are not considered large. The design of OLTP tables matter a lot, but folks don't expect much from the design of staging tables. If these tables get truncated and reloaded by SAS for reporting purposes, then I wouldn't worry about it too much.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply