February 26, 2019 at 7:38 pm
I have a data profiling type of question :
I'm interesting in a query to determine what percent of a varchar column is a valid numeric?
sql*server 2012.
i.e. the varchar column may hold a value like 123.4 or a single blank or a value like 'NA' ..
From the data profiling perspective, this might be something that is a common type of requirement.
If anyone is willing to share .. if they use anything similar ?
February 27, 2019 at 1:40 am
Personally, I don't put numbers in varchar if I want them to be numbers. I put numbers in numbers columns. That way, no confusion. However, what you're looking for is this function: ISNUMERIC. You should be able to pretty easily write a query that gets a count of the table and a count of the table where the VARCHAR columns return true from that function. Compare the two to arrive at a percentage.
Piece of advice, dates go in date data types, numbers go in numbers, strings in strings. Don't mix them up for any reason or you pay a very high cost later.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
February 27, 2019 at 7:31 am
Grant Fritchey - Wednesday, February 27, 2019 1:40 AMPersonally, I don't put numbers in varchar if I want them to be numbers. I put numbers in numbers columns. That way, no confusion. However, what you're looking for is this function: ISNUMERIC. You should be able to pretty easily write a query that gets a count of the table and a count of the table where the VARCHAR columns return true from that function. Compare the two to arrive at a percentage.Piece of advice, dates go in date data types, numbers go in numbers, strings in strings. Don't mix them up for any reason or you pay a very high cost later.
Careful now... things like 8E2, a single comma, multiple commas (for example, ',,,,' will return a 1 for ISNUMERIC()), and a raft of other stuff is considered to be valid numerics because of potential conversions to engineering data types, the Money data type, and more.
I DO , however, agree that things like "NA" or "N/A" are formatting junk for reports and someone has made the serious mistake of combining the presentation layer with the data layer if they've done such a thing. That really needs to be fixed because it's causing a world of hurt right now and will only get worse as time wears on and people and code becomes more and more dependent on this mistake being present. It's also killing performance and wasting disk space and memory. If it's a big table, then it's wasting a lot. The indexes are probably taking a beating for page splits due to "ExpAnsive" updates, which causes unnecessary fragmentation, reduction in page density (waste of memory and disk space again), and a shedload of totally unnecessary log file activity which further impacts performance.
People will also say they already have too many dependencies on the data to fix it now. Ladies and Gentlemen, THAT's the best reason to fix it because THAT means that you have those same problems all over the place!
--Jeff Moden
Change is inevitable... Change for the better is not.
February 27, 2019 at 7:36 am
allan.ford17 - Tuesday, February 26, 2019 7:38 PMI have a data profiling type of question :
I'm interesting in a query to determine what percent of a varchar column is a valid numeric?
sql*server 2012.
i.e. the varchar column may hold a value like 123.4 or a single blank or a value like 'NA' ..
From the data profiling perspective, this might be something that is a common type of requirement.
If anyone is willing to share .. if they use anything similar ?
What, in particular, do you consider to be "numeric"??? You have the obvious thing of "123.4" but is "1" valid (something with no decimal point)? Is a single period valid? Is a number like ".4" valid or should it be "0.4"?
As for a "data profiling perspective", see my note in response to what Grant wrote above. Your "data profile" is broken and it really needs to be fixed instead of wasting your time with trying to separate the wheat from the chaff. The ONLY exception to this "rule" is if it's data that you're receiving from a third party and you're trying to "profile the data" to fix it in the manner I identified in that previous response.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 27, 2019 at 10:41 am
That's one of the reasons SQL allows NULL "values" in columns (NULL is technically not a value, it's the absence of a value; for day-to-day purposes, you can think of it as "unknown").
So, define the column as numeric and store non-numeric values as NULL. You can add a separate flag column for that column if you want to be able to distinguish blank from 'NA', etc..
For example:
value decimal(9, 3) NULL
value_flag tinyint NOT NULL /*0=value is numeric; 1=value was blank; 2=value was 'NA'; etc.*/
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 27, 2019 at 1:17 pm
Here's an example on how to get a percentage of data that is not numeric. It can also help with all the conversion to numeric columns that has been suggested by others and I totally support.
--I'm creating a table with sample data
SELECT TOP 1000
CAST(RAND(CHECKSUM(NEWID()))*1000 AS varchar(25)) SomeColumn
INTO #SampleTable
FROM sys.all_columns x, sys.all_columns y;
--Converting some numeric values into non-numeric
UPDATE #SampleTable
SET SomeColumn = 'NA'
WHERE SomeColumn LIKE '1%';
--TRY_CAST will convert to NULL all values that are not valid for the conversion. NULLs will be ignored by the COUNT function. The rest is a basic percentage formula
SELECT COUNT(TRY_CAST( SomeColumn AS float)) * 100. / COUNT(*)
FROM #SampleTable;
GO
DROP TABLE #SampleTable
February 27, 2019 at 2:23 pm
Sometimes you just have to deal with entity-attribute-value tables that were created over a decade ago with dozens of filtered indexes... 🙁
I've used something similar to the following in the past as a general wag:SELECT CAST(COUNT(CASE WHEN st.SomeColumn NOT LIKE '%[0-9]%' THEN 1. END) / CAST(COUNT(*) AS float) * 100.00 AS decimal(10,3)) AS NumericPercentage
FROM #SampleTable AS st
;
February 27, 2019 at 3:12 pm
Lots of excellent advice and answers .. thank you all ..
Appreciate warnings from Jeff about the ISNUMERIC function.
For what I wanted .. rough data profiling .. the ISNUMERIC function should be good enough ..
I used :
SELECT CAST(COUNT(CASE WHEN ISNUMERIC(st.[Some Column Name])=1 THEN 1. END) / CAST(COUNT(*) AS float) * 100.00 AS decimal(10,3)) AS NumericPercentage
FROM SomeTable AS st
Yes- I have opportunity to redesign table (not my design) and will carefully choose data types !
The source of the data is a CSV and the CSV landing table has all columns as varchar(50) ..
February 27, 2019 at 4:25 pm
allan.ford17 - Wednesday, February 27, 2019 3:11 PMLots of excellent advice and answers .. thank you all ..Appreciate warnings from Jeff about the ISNUMERIC function.
For what I wanted .. rough data profiling .. the ISNUMERIC function should be good enough ..
I used :
SELECT CAST(COUNT(CASE WHEN ISNUMERIC(st.[Some Column Name])=1 THEN 1. END) / CAST(COUNT(*) AS float) * 100.00 AS decimal(10,3)) AS NumericPercentage
FROM SomeTable AS stYes- I have opportunity to redesign table (not my design) and will carefully choose data types !
The source of the data is a CSV and the CSV landing table has all columns as varchar(50) ..
Heh.. THAT is the exception that I was talking about. Landing/Staging table for input preprocessing. You have the idea to make the changes to the final destination table. Thanks for the feedback.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply