November 25, 2003 at 6:39 am
I have to data sent to me in a social security number field. I have to fix the data. I've used replcee to remove the dashes and change the ' ' to '' however users sometime enter alpha characters. The db field takes the characters because its varchar. I do not want to reject a record. So I'm cleaning up the data. If they add a character to a specific spot it would be easy but it can be any of the numbers. Is there a way to search for and remove alpha characters?
November 25, 2003 at 7:02 am
If you're using SQL Server 2000, this is easy with a User Defined Function (UDF), e.g.:
CREATE FUNCTION dbo.f_RemoveChars(@Input varchar(1000))
RETURNS varchar(1000) AS
BEGIN
DECLARE @pos smallint
SET @Pos = PATINDEX('%[^0-9]%',@Input)
WHILE @Pos > 0 BEGIN
SET @Input = STUFF(@Input,@pos,1,'')
SET @Pos = PATINDEX('%[^0-9]%',@Input)
END
RETURN @Input END
--Jonathan
Edited by - Jonathan on 11/25/2003 07:02:36 AM
--Jonathan
November 25, 2003 at 12:10 pm
Wow, that was great. Worked awesome
November 26, 2003 at 8:51 am
Can someone point me to a good reference describing (dumbed down) bitwise operators such as ^ and giving examples of their use?
thanks
k2
November 26, 2003 at 9:03 am
Don't know if you've seen it, but SQL Books online has good sections on each of the bitwise operators AND (&), NOT (~), OR (|) and exclusive OR (^), which are technical but give examples by way of explanation.
Also, in the example code given by Jonathan, the ^ operator is used not as a bitwise operator but as a wildcard character telling PATINDEX to exclude characters in the range 0 - 9. Just in case you were confused by that particular usage!
If you're looking for further articles on bitwise operators, it might be worth starting a new thread.
Cheers,
mia
Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.
Cheers,
mia
Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.
November 26, 2003 at 5:48 pm
You can also check with ISNUMERIC inside a loop (loop through the characters in the string). If not numeric, then replace with the charaters you want.
Another method is checking the ASCII value of the characters. The ASCII table you can easily get from Internet.
Regards,
kokyan
November 27, 2003 at 2:04 pm
For long strings with more than a small percentage of characters that must be stripped, the overhead of the PATINDEX and STUFF functions will degrade performance.
For these cases, a substantial improvement can be had simply by looping through every character of the string and concatenating the desired characters onto a new string.
An additional, albeit smaller, gain can be made by replacing the explicit loop with an auxiliary numbers table and an SQL set-based (select) statement as shown below.
-- One-time numbers table creation and initialization
create table Num(n int)
declare @i int
set @i = 0
while @i <= 12
begin
insert Num values(@i)
set @i = @i PLUS 1
end
--[where PLUS represents the plus sign]
-- Using the table to strip non-digit characters from a social security number:
declare @ss1 varchar(15)
declare @ss2 varchar(15)
set @ss1 = '111-22x3333'
set @ss2 = ''
select @ss2 = @ss2 PLUS case when substring(@ss1, n, 1) between '0' and '9' then substring(@ss1, n, 1) else '' end
from Num
where n <= len(@ss1)
-- [where PLUS represents the plus sign]
select 'Original SSN' = @ss1, 'Clean SSN' = @ss2
Note that an auxiliary numbers table can be effectively used in a variety of situations.
December 5, 2003 at 10:05 am
quote:
For long strings with more than a small percentage of characters that must be stripped, the overhead of the PATINDEX and STUFF functions will degrade performance.For these cases, a substantial improvement can be had simply by looping through every character of the string and concatenating the desired characters onto a new string.
Did you test this? I did, and you're wrong.
--Jonathan
--Jonathan
December 8, 2003 at 9:25 am
Yes, and here are my test cases. Note that if you reverse the ordering of the second case the results are reversed - revealing the obvious ... that character distribution matters. Bottom line - test algorithms against representative data samples.
declare @s1 varchar(500), @s2 varchar(500),
@d datetime,
@pos int, @i int
declare @loop int, @max_loops int,
@tl1 int, @tl2 int,
@ts1 int, @ts2 int
select@tl1 = 0, @tl2 = 0,
@ts1 = 0, @ts2 = 0,
@max_loops = 1000
set @loop = 0
-- Average each test over @max_loops runs
while @loop < @max_loops
begin
--
-- Test 1: Repeated Address data of length 480 (Digits comprise 35% of string)
--
-- Strip non-digits using concatenation
select@s1 = replicate('123 Drury Lane Apt10 Alba, CA 83952-4830', 12),
@d = getdate(),
@i = 1,
@s2 = ''
while @i <= len(@s1)
begin
if substring(@s1, @i, 1) between '0' and '9'
begin
set @s2 = @s2 PLUS substring(@s1, @i, 1)
end
set @i = @i PLUS 1
end
select @tl1 = @tl1 PLUS datediff(ms, @d, getdate())
-- Strip non-digits using STUFF/PATINDEX
select@s1 = replicate('123 Drury Lane Apt10 Alba, CA 83952-4830', 12),
@d = getdate(),
@Pos = PATINDEX('%[^0-9]%',@s1)
WHILE @Pos > 0
BEGIN
SET @s1 = STUFF(@s1,@pos,1,'')
SET @Pos = PATINDEX('%[^0-9]%',@s1)
END
select @ts1 = @ts1 + datediff(ms, @d, getdate())
if @s1 <> @s2 print 'Test 1 Error'
--
-- Test 2: 240 digits followed by 240 non-digits (Digits comprise 50% of string)
--
-- Strip non-digits using concatenation
set @s1 = replicate('1', 240) PLUS replicate('a', 240)
select @d = getdate()
set @i = 1
set @s2 = ''
while @i <= len(@s1)
begin
if substring(@s1, @i, 1) between '0' and '9'
begin
set @s2 = @s2 PLUS substring(@s1, @i, 1)
end
set @i = @i PLUS 1
end
select @tl2 = @tl2 PLUS datediff(ms, @d, getdate())
-- Strip non-digits using STUFF/PATINDEX
set @s1 = replicate('1', 240) + replicate('a', 240)
select @d = getdate()
SET @Pos = PATINDEX('%[^0-9]%',@s1)
WHILE @Pos > 0
BEGIN
SET @s1 = STUFF(@s1,@pos,1,'')
SET @Pos = PATINDEX('%[^0-9]%',@s1)
END
select @ts2 = @ts2 PLUS datediff(ms, @d, getdate())
if @s1 <> @s2 print 'Test 2 Error'
set @loop = @loop PLUS 1
end
select @tl1 / @max_loops as test1_concat,
@ts1/@max_loops as test1_stuff,
@tl2 / @max_loops as test2_concat,
@ts2 / @max_loops as test2_stuff
test1_concat test1_stuff test2_concat test2_stuff
------------ ----------- ------------ -----------
7 12 8 21
December 8, 2003 at 12:15 pm
quote:
Yes, and here are my test cases.
Interesting; I stand corrected with data like this. Thank you. I tested on an actual table where there weren't as many embedded characters and so didn't reach your qualification percentage.
I found that it's PATINDEX, not STUFF, that is relatively slow. It may do something akin to the way your code evaluates each character, except once each loop rather than once each character of the string. As STUFF is not the culprit, perhaps something like this would be better in cases like you tested:
CREATE FUNCTION dbo.f_RemoveChars(@Input varchar(1000))
RETURNS varchar(1000) AS BEGIN
DECLARE @Pos smallint
SET @Pos = 1
WHILE @Pos <= LEN(@Input) BEGIN
IF SUBSTRING(@Input, @Pos, 1) BETWEEN '0' AND '9'
SET @Input = STUFF(@Input, @Pos, 1, '')
SET @Pos = @Pos + 1
END
RETURN @Input END
--Jonathan
--Jonathan
December 8, 2003 at 12:42 pm
Jonathan,
I agree about using 'STUFF' - in fact, when I went back to my code to copy it into my message, I realized that I had used 'STUFF', rather than concatenatation, but re-wrote it to confirm my original claim.
June 11, 2006 at 3:10 pm
Sorry I'm so late on this one (2 1/2 years aint too bad )... just ran across it...
ISNUMERIC is not a good thing to check for "All Digits" because it allows periods, +/- signs, dollar signs, currency type idicators, spaces, hard spaces, tabs, line feeds, and a wad of other things.
Use something like the following to find all values in a column that have some other than 0-9 in it...
SELECT * FROM sometable WHERE somecolumn NOT LIKE '%[^0-9]%'
You'll be surprised at how fast the double "NOT" works (^ in the rational expression stands for "NOT")
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply