Script to find incorrect data in all tables

  • Our clients use MS-SQL 2000 through 2008, and soon will have some using 2012.

    WinXP through Win7, very few on Vista. 32-bit & 64-bit.

    The issue is an old application bug allowed data that was incorrect to be written. For example, NULL where it should not be NULL, and empty strings, '' or strings with a single space ' '. I'm on the support side, so I can't control how the program does its job, but the programmers don't have any capacity to give us a quick solution, so I am researching to see if there is a way to save time finding this weird data.

    This is not an issue of database corruption because "DBCC CHECKDB (OWFIN) WITH NO_INFOMSGS, ALL_ERRORMSGS" returns no errors.

    There are some instances where old data can cause unexpected and incorrect results on some reports and it is time consuming to track down which table(s) has the problem data.

    Other than writing a script customized to each table in the database, is there a quick script that can check all tables in the database for NULL, '', and ' '?

    I have tried googling for a script/how to along those lines and searching this site, but am not coming up with something that will work. Either I am not asking for the correct term, or don't have the right combination of words to find it, if such a thing exists.

    I have built a lot of scripts that look for known data issues caused by old bugs or incorrect solutions to those bugs implemented by former techs. Some of these scenarios are unique enough that no catch-all script will work, but for scenarios where values that should not be written have been written, is there a quick and easy script to find them?

    These tables have constraints for "Not Null", but the bug let NULL be written anyway. Is there an easy way to get a list where a "not Null" column has Null in it?

    Thanks for any links, tips, suggestions, or scripts that will help with this.

    ~ Larry

  • Here's something that might get you started:

    ;WITH NullableColumns AS (

    SELECT TableName=a.name, ColName=b.name

    FROM QA.sys.objects a

    INNER JOIN QA.sys.all_columns b

    ON a.object_id = b.object_id

    WHERE is_nullable = 0 and type = 'U'

    )

    SELECT 'SELECT ' +

    STUFF((

    SELECT ',' + ColName

    FROM NullableColumns b

    WHERE a.TableName = b.TableName

    FOR XML PATH(''))

    ,1, 1, '') +

    ' FROM ' + TableName +

    ' WHERE ' +

    STUFF((

    SELECT ' AND ISNULL(' + ColName + ','''') <> '''''

    FROM NullableColumns b

    WHERE a.TableName = b.TableName

    FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)')

    ,1, 5, '')

    FROM NullableColumns a

    GROUP BY TableName

    This generates SQL statements for every table in the database that you run it in that will select out any columns where NULL is not allowed if the value in a row is NULL, '' or ' '.

    You could put the results into a temp table and iterate through that to execute each of the SQL statements.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • lm_hamilton (10/7/2012)


    Our clients use MS-SQL 2000 through 2008, and soon will have some using 2012.

    WinXP through Win7, very few on Vista. 32-bit & 64-bit.

    The issue is an old application bug allowed data that was incorrect to be written. For example, NULL where it should not be NULL, and empty strings, '' or strings with a single space ' '. I'm on the support side, so I can't control how the program does its job, but the programmers don't have any capacity to give us a quick solution, so I am researching to see if there is a way to save time finding this weird data.

    This is not an issue of database corruption because "DBCC CHECKDB (OWFIN) WITH NO_INFOMSGS, ALL_ERRORMSGS" returns no errors.

    There are some instances where old data can cause unexpected and incorrect results on some reports and it is time consuming to track down which table(s) has the problem data.

    Are you absolutely sure that fixing "some reports" won't break everything else?

    Other than writing a script customized to each table in the database, is there a quick script that can check all tables in the database for NULL, '', and ' '?

    I have tried googling for a script/how to along those lines and searching this site, but am not coming up with something that will work. Either I am not asking for the correct term, or don't have the right combination of words to find it, if such a thing exists.

    I have built a lot of scripts that look for known data issues caused by old bugs or incorrect solutions to those bugs implemented by former techs. Some of these scenarios are unique enough that no catch-all script will work, but for scenarios where values that should not be written have been written, is there a quick and easy script to find them?

    These tables have constraints for "Not Null", but the bug let NULL be written anyway. Is there an easy way to get a list where a "not Null" column has Null in it?

    That's unlikely. If a column has a not null constraint, it won't allow nulls.

    Thanks for any links, tips, suggestions, or scripts that will help with this.

    ~ Larry

    A shotgun approach like this puts shivers down my spine. You have a legacy app which uses NULL, empty string and a single space, all of which mean different things and on which the app and the database probably rely for data integrity and functionality, and you want to change all of these values to support a few reports. I'd recommend you perform a great deal more testing before proceeding with this. Find out exactly why the reports are giving unexpected results - track down the tables and columns used by them and assess the impact on other processes of the changes which you would make to those tables and columns. If you're going to make any changes then do it in a test environment.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (10/8/2012)


    That's unlikely. If a column has a not null constraint, it won't allow nulls.

    A shotgun approach like this puts shivers down my spine. You have a legacy app which uses NULL, empty string and a single space, all of which mean different things and on which the app and the database probably rely for data integrity and functionality, and you want to change all of these values to support a few reports. I'd recommend you perform a great deal more testing before proceeding with this. Find out exactly why the reports are giving unexpected results - track down the tables and columns used by them and assess the impact on other processes of the changes which you would make to those tables and columns. If you're going to make any changes then do it in a test environment.

    And you can't ALTER a column to NOT NULL if it contains NULL values already.

    My take on the last part is that it's more likely sloppiness on the part of the developers that let NULL, '' and ' ' into the column. Maybe they mean different things but probably they don't.

    But I agree with the need to test whatever you end up doing.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • The "shotgun approach" is to more easily __find__ these data issues.

    They have to be resolved one at a time.

    Some of them are flat out bad data and the whole row can be deleted. However, one must look to see for sure what needs to be done.

    Like I said these are old bugs that should have not done this in the first place and the program has been fixed to stop writing the bad data, I just need a way to find it, since something that happened years ago is causing a report on the current year to be wrong.

    These are situations that are invisible to the user, or a way to fix them is unavailable to them.

    I know that development does not have the time to give us reports or anything else to handle this. This may only be a handful of clients, but having one more tool in my toolbox to identify bad data without having to code a SQL statement for each table will be extremely helpful. This script does exactly that, THANKS!!!

    Unfortunately, our clients don't always pay attention and catch these things when they first occurred and 6 or 7 years later now it is a problem. This points out that if one client has this, then others might have this, so I am just trying to be prepared.

    Thanks for the script, that will go a long way to speed up finding issues to evaluate for fixing.

    Also, thanks for the concern to make sure I am not clueless when it comes to fixing data. I don't know how the programming language that is used allows things like this to happen, but an old version of the programming language would let it write data when that meant duplicate keys. When we upgraded to a newer version of the programming language, we discovered that we had been exploiting a bug as a feature for the life of the program to that point. That made it a fun job in support to explain that one to clients. So while the database __should__ stop such behavior, there is something about this programming language that did things like this. Thankfully, I think that is in the past. However, there are still bits of data that need cleanup.

    Since these are data issues from old bugs, I know that development won't put a priority on doing anything on their end, so it falls to support to get clients past things like this. Rather than complain that what should be is not, I choose to be pro-active and fix the issue, make the client happy and move on to the next issue.

    I am thankful that this product has a documented data model with great technician documents and I am very familiar with it.

    If it was another product I support, I would not be asking for help here. Its data model looks like it does when people first get their hands on MS-Access, "Oooh, I built a data base!" Data repeated in every table instead of relying on the power of relational databases to track it. That nightmare of a database is being migrated from Advantage to MS-SQL without fixing the issues with the data model. There is no documented data model, unless development won't share, and no technician documentation. Every new support tech to this product has to figure it out as they go. Now that is something to complain about. If I didn't need to sleep I would figure out what table changes when I do something on each screen so that anyone could support it. So with problems on that product, I make its programmers answer the questions since I don't have time to figure it out in a time frame that is useful to the client.

    Well, enough venting, it doesn't help since management has to give development approval to document the data model, or better yet, do it right, nothing I can do will change that.

    Thanks again!

    ~ Larry

  • Here's a shotgun. It has the potential to bring your server to it's knees, so don't even think about running this anywhere near a production server.

    DECLARE @sql AS VarChar(5000)

    SET @sql = ''

    SELECT @sql = @sql +

    'SELECT * FROM [' + t.TABLE_SCHEMA + '].[' + t.TABLE_NAME +'] where cast(['

    + c.COLUMN_NAME + '] as nvarchar(4000)) = '''' '

    FROM INFORMATION_SCHEMA.TABLES t, INFORMATION_SCHEMA.COLUMNS c

    where t.table_name = c.table_name

    and is_nullable = 'no'

    and collation_name is not null

    EXEC (@SQL)

  • lm_hamilton (10/8/2012)


    Thanks for the script, that will go a long way to speed up finding issues to evaluate for fixing.

    Also, thanks for the concern to make sure I am not clueless when it comes to fixing data. I don't know how the programming language that is used allows things like this to happen, but an old version of the programming language would let it write data when that meant duplicate keys. When we upgraded to a newer version of the programming language, we discovered that we had been exploiting a bug as a feature for the life of the program to that point. That made it a fun job in support to explain that one to clients. So while the database __should__ stop such behavior, there is something about this programming language that did things like this. Thankfully, I think that is in the past. However, there are still bits of data that need cleanup.

    ~ Larry

    You're welcome for the script. It was relatively close to something I had at hand so I was happy to modify it to suit your purposes.

    I disagree with your statement "I don't know how the programming language that is used allows things like this to happen" in that you seem to be blaming te programming language when we all know that computers (and programming languages) only do what they are told to do. The fault of this lies entirely with the developers that misuse it, especially that part about the code relying on an inherent bug in the language. None of these would be easy to spot during a development process without some kind of automated tools.

    I'm working on developing a set of those (this post gave me an idea for one) and it is from there that I was able to extract the script that I supplied you.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply