Null vs ''

  • Working through an ETL project at the moment and started to question some of my previous design thoughts.

    I have never previously concerned myself with the data and performance as all has been reasonable but with future projects looking to be larger volumes of data I have a few concerns.

    Currently I receieve a lot of low grade data so currently working on address fields. 7 Address fields and a post code field. Up to now I have just shipped out the data I have been fed. So if it is a blank space or two balnk space in the field I just send that.

    If you were receiving this data and comparing it to look for matches would you prefer:

    1. to receive it as it is

    or

    2. all data items to be NULL if the data item is just blank space(s)

    or

    3. Something else

    Please also explain why you would prefer one option more thabn another.

    My own thought is to send a NULL if it is blank as that is easy to capture, BUT to counter that when joining a NULL field with another that can give a NULL so that can cause issues as well.

    Many thanks.

    M

    😎

  • I would prefer to have NULLS rather than '' as it is obvious that the value is blank.

  • If you have the option to clean the data in the database itself, rather than just the extracts, I'd recommend you do that. Use NULL where the value is unknown, and blank where it doesn't exist. Then export the data as it is, so that it can easily be imported into a different system.

    John

  • I Import the data to a staging table that is an exact coipy of the original data (warts and all). It then goes through a transformation into output tables ready to go.

    I do limited cleaning of data but havebeen thinking a lot about these fields as I know they will be used a lot for comparisons for matching so the better they are the easier it should be.

    Is there a better way than just using a

    case when field ='' then NULL

    else field

    END as field

    ?

    Thanks

    M

  • case when field ='' then NULL

    else field

    END as field

    NULLIF(YourColumn, ' ')

    ==========================================================================================================================
    A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe

  • It might be more efficient to do the update in your staging or output table instead of as part of the transformation. You'll need to test to find out which way is best.

    Aside from that, my advice is not to do this. NULL means unknown, so to use it to represent a known absence of a value is something of an abuse.

    John

  • Thank you 🙂

    I knew there must be a straightforward way

    M

  • John Mitchell-245523 (1/3/2013)


    It might be more efficient to do the update in your staging or output table instead of as part of the transformation. You'll need to test to find out which way is best.

    Aside from that, my advice is not to do this. NULL means unknown, so to use it to represent a known absence of a value is something of an abuse.

    John

    I always use the staging as a copy of the original and never ever do anything to the data. That is a concept I hold dear to me. This way I always know exactly what I received from the source data.

    In this instance I have discussed address fields so pretty much I am safe that there is not a road called '' - so safe to change to NULL. Would not consider changing the data if there was any meaning to the contents. I am also interested in doing the same with peoples names as '' has no meaning in first, middle or last name. Other data fields I am not to woriied about as a space may have some bizarre meaning and is not used for comparisons and checks.

    Thanks

    Mark

  • Mark

    Fair enough about the staging table - it was just a suggestion. Might still be worth considering on the output table, though.

    If I may say so, the thing about second meanings isn't the point. A NULL should represent an unknown value and nothing else. Take the example about names. If you have MiddleName is blank, that means that person doesn't have a middle name. If you have MiddleName is NULL, that means you don't know what that person's middle name is, or whether they have one. See the difference? If you turn all blanks into NULLs, you lose the ability to make that distinction.

    John

  • With hindsight I should have wirded the question differently. I know the recipients of this data will treat '' and NULL in the same way. Other ETL projects where the data is used for reports I would leave 'as is' but this one I am pondering the change.

    Cheers

    M

  • If you're providing data for report or any type of BI application, remove NULLs entirely. NULLs cause serious and severe logic considerations that most people don't fully understand (including even some "DBAs").

    If ANY user, even a power user, needs to write queries against the data, NULLs should especially be avoided and will cause you far more headaches than they are worth.

    Replace any NULLs either with blank or 'Unknown' or equivalent that is clear to the end-users and that they support.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (1/3/2013)


    If you're providing data for report or any type of BI application, remove NULLs entirely. NULLs cause serious and severe logic considerations that most people don't fully understand (including even some "DBAs").

    If ANY user, even a power user, needs to write queries against the data, NULLs should especially be avoided and will cause you far more headaches than they are worth.

    Replace any NULLs either with blank or 'Unknown' or equivalent that is clear to the end-users and that they support.

    i will second Scott here.

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • I think I said somewhere before - it is not for reporting / BI. Where I have done data warehouses before I have left the data as is.

    Although I would counter that and say that there will always be a case for checking data and reviewing if there is a cleansing / modifiying requirement. Although generally I would retain the original column in the table and add a new column with the modified data. This retaining the opriginal 'as is' and also the cleansed value.

    The data in this case is used for comparisons and number crunching. Therefore on some fields there could be an advantage if the data is prepped first. Was just wondering if going to NULL was the sensible choice.

    Cheers

    M

  • To NULL or not to NULL, that is the question... 😛

    Whatever you choose, I'd do something like this:

    --Get rid of any leading, trailing, or single spaces then if blank make it NULL

    SELECT NULLIF(LTRIM(RTRIM(foo)),'') FROM bar

    --Or make all NULLs blank

    SELECT ISNULL(foo,'') FROM bar

    --Or make all NULLs some default value (in this case zero)

    SELECT ISNULL(foo,0) FROM bar

    --Or a combination of all three

    SELECT ISNULL(NULLIF(LTRIM(RTRIM(foo)),''),0) FROM bar

    I agree generally with the concept that NULL means "unknown" and blank means blank. Even better is to have a design in which NULL is not an option and there is always some default that is not null (or blank). In the real world, especially when importing and scrubbing data, the purity of db design and development sometimes gets lost in the weeds.

    I have to deal often with an application that CANNOT pass NULLs at all. So for this particular app, all of my input parameters--even numeric values, xml, and uniqueidentifiers--have to be char/varchar because the application passes '' (blank) rather than null. You can imagine the validation headaches that causes!

    Another possible reason to use blank values might if they are being concatenated with other values. If any of the values are NULL then the result of the concatenation will also be NULL. Thus any NULLs would have to be converted to a string value anyway. It all depends on where you want to do the work--fixing up the raw data or just doing what needs to be done in the query procedure at run time.

     

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply