November 19, 2008 at 2:24 pm
Hi,
I've searched everywhere and can't find any info on this topic. Sorry for the long post.
Some background:
We have an application that accepts text data files from our customers.
In production this data is loaded to input tables via DTS scripts that are automatically generated by our application. The columns in these tables are of all data types, but most often are int, numeric, varchar or money. Once the data is stored in the input tables I have the ability to check the data for referential integrity and other potential problems prior to accepting the data into the production tables.
But, during the implementation phase of our engagement I need to validate the content and format of their data files so that I know it will be usable by our system. In other words I need to make sure that they're not sending us text data in a column that ultimately will be integer, etc.
I've recently created a data validation process that loads this raw data to validation tables. These tables are automatically created by the system and all columns are defined as Varchar(255) so that the column will accept all data no matter how bad it is (and believe me it can be bad!). 🙂
Once the data is loaded to these validation tables (one for each of the customer's text files) I custom compile a stored procedure that evaluates the data in each column for pre-defined rules (such as no text data in an integer column) based upon the format of the target column in the production tables.
This all works very well.
Now I'm attempting to validate the length of the customer's raw data. For columns that are varchar that's easy: I can check syscolumns.length for the target column and know that if the length is '50' then none of the data in the validation table's column can exceed 50 characters. This is easy.
My problem comes in when looking at target columns that are int, numeric, date or money. syscolumns.length presents the number of bytes, but that's not the number of characters.
Here's my question: Does anyone know of a way to translate the column length in bytes into the number of characters for that data type? For example I'd like to know that data type int can be a character length of x, etc. I know that I could hardcode the min and max values for a data type (per those specified in BOL), but that seems klugy.
Does anyone have a better idea how I could do this efficiently? Thanks for any direction you can give me.
We're running SQL Server 2000 EE on Win2K3 Enterprise.
November 19, 2008 at 8:55 pm
Quite useless exercise.
Values 5000000000 and 2000000000 have the same number of characters but one of them is proper int, another one is not.
It's even more complicated with dates. Same date presented in different formats will have different number of characters. And again, right number of characters won't give you insurance the date is a valid one. 2222-22-22 is absolutely OK in terms of length, but it's not a valid date.
So, performing such check you cannot really validate the data, it may just say to you the value is not TERRIBLY wrong, but wat's the use of it anyway if it does not guarantee the value is right?
_____________
Code for TallyGenerator
November 19, 2008 at 8:58 pm
Chuck Hardy (11/19/2008)
I've recently created a data validation process that loads this raw data to validation tables. These tables are automatically created by the system and all columns are defined as Varchar(255) so that the column will accept all data no matter how bad it is (and believe me it can be bad!). 🙂
You validation process will fail with my files which contain Product Description up to 1500 characters long.
😉
_____________
Code for TallyGenerator
November 20, 2008 at 4:33 am
Hi Sergiy,
Thanks for your thoughts.
You're absolutely right about the length of integer values that would pass or fail based on a length check. I can check the length of the 1500 character string you mention because I know the format and size of the data I'm expecting so I can validate against that. (The 255 length of the input tables I mentioned is longer than any data I would expect.)
I guess I'm looking for a shortcut.
With integers I guess I could filter out any data with a length of < 10 characters and then check for any outliers that exceed -2,147,483,648 and 2,147,483,647. I'm already checking columns using ISNUMERIC, but there isn't a corresponding ISINT function.
You're right about dates, however I'm already checking date columns with ISDATE so the length problem is not an issue.
Thanks,
Chuck
November 20, 2008 at 5:21 am
Chuck Hardy (11/20/2008)
With integers I guess I could filter out any data with a length of < 10 characters and then check for any outliers that exceed -2,147,483,648 and 2,147,483,647. I'm already checking columns using ISNUMERIC, but there isn't a corresponding ISINT function.
To compare with max int value you need to convert to int first.
If it's not int it will fail and return an error - before you can do the validation.
And there is kind of ISINT function:
NOT LIKE '%[^0-9]%'.
In combination with length limitation could be sufficient.
But it would reject negative int values, must be taken into consideration.
You're right about dates, however I'm already checking date columns with ISDATE so the length problem is not an issue.
ISDATE has opposite problem.
22/11/2008 is perfectly valid date in the country where I live, but some of ours SQL Servers having US locale installed by default will return ISDATE = 0 for this string.
In our environment where we process files generated on software made in USA, Germany, UK, Australia, etc., ISDATE is proven to be quite useless.
_____________
Code for TallyGenerator
November 20, 2008 at 7:32 am
Hi Sergiy,
Hey, thanks for the NOT LIKE suggestion - I wouldn't have thought of that! In our specific case I shouldn't ever get a negative integer value so that could work.
What I was doing with the integer values was evaluating the column as NUMERIC (which will permit values > INT) and then I was going to do a min/max check. To find non-integer values I select records where
(Convert(Int, (Convert(Numeric(15,6), COLUMN_NAME) * 1000000)) % 1000000) > 0
as problems.
The issue you mention with ISDATE is an interesting one. While all our customers are in the US, that doesn't mean we won't have one outside the US one day. I'm not sure how to get around that at the moment.
I really appreciate your help! 😀
Chuck
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply