how to strip out all non alpha numeric characters?

Question

how to strip out all non alpha numeric characters?

BaldingLoopMan

SSCertifiable

Points: 6378
More actions
March 23, 2010 at 8:59 am

#216909

--drop table #addr
create table #addr
( id int identity (1,1)
, address1 varchar(200)
)
insert into #addr
select '532 40th ave. n.e. apt)104'
union all
select '1000 sector way st. apt)111-ac))'
union all
select '3333 summer-point place'
union all
select 'p.o.box 22000a-330042'
union all
select '5566 first stett n. lot.204-a apt)2-c'
--should return only alpha numeric chars.
'53240thaveneapt104'
'1000sectorwaystapt111ac'
'3333summerpointplace'
'pobox22000a330042'
'5566firststettnlot204aapt2c'

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323755 More actions · Answer 1

i think this will do what you are after; this leaves 0-9,A-Z,a-z and strips out everything else...commas,punctuation, hi ascii, etc.

select dbo.StripNonAlphaNumeric(address1) from #addr

--results

53240thaveneapt104

1000sectorwaystapt111ac

3333summerpointplace

pobox22000a330042

5566firststettnlot204aapt2c

select ascii('a'),ascii('z'),ascii('A'),ascii('Z')

GO

CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))

RETURNS VARCHAR(8000)

BEGIN

DECLARE @CleanedText VARCHAR(8000)

;WITH tally (N) as

(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

FROM Master.dbo.SysColumns sc1

CROSS JOIN Master.dbo.SysColumns sc2)

SELECT @CleanedText = ISNULL(@CleanedText,'') +

CASE

--ascii numbers are 48(for '0') thru 57 (for '9')

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57

THEN SUBSTRING(@OriginalText,Tally.N,1)

--ascii upper case letters A-Z is 65 thru 90

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90

THEN SUBSTRING(@OriginalText,Tally.N,1)

--ascii lower case letters a-z is 97 thru 122

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END

FROM tally WHERE Tally.N <= LEN(@OriginalText)

RETURN @CleanedText

END

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

BaldingLoopMan SSCertifiable Points: 6378 More actions · Answer 2

That's exactly what i was looking for. Thanks Lowel. I knew there was a way via ascii but hadn't done it before. This ones going in the tool box.

Again thank you!

LutzM SSC Guru Points: 107049 More actions · Answer 3

Maybe you could make the CASE statement of Lowells nice function a little shorter:

CASE

WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END

It seems like it'll return the same value.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

matzer SSCrazy Eights Points: 9727 More actions · Answer 4

lmu92 (3/23/2010)
Maybe you could make the CASE statement of Lowells nice function a little shorter:
CASE
WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
It seems like it'll return the same value.

It seems, but it really depends on the collation being used. Check out the following script:

DECLARE @OriginalText VARCHAR(MAX) = '0;:ëêà^AaZz'

SELECT

CASE

WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-Z]'

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END,

CASE

WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AS,Tally.N,1)) like '[0-Z]'

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END,

CASE

WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AI,Tally.N,1)) like '[0-Z]'

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END

FROM

(

SELECT Number N FROM master..spt_values WHERE Type = 'P'

) Tally

where

Tally.N <= LEN(@OriginalText)

If you really want to remove any character but 0-9, a-z, A-Z compare the Ascii-codes. Also note that in a binary collation ':' and ';' fall between [0-Z].

Peter

LutzM SSC Guru Points: 107049 More actions · Answer 5

Thanx for clarification, Peter!

I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement... :blush:

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

matzer SSCrazy Eights Points: 9727 More actions · Answer 6

lmu92 (3/23/2010)
Thanx for clarification, Peter!
I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement... :blush:

Well, you were close. I think this one will do the job too

CASE

WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-9a-zA-Z]'

THEN SUBSTRING(@OriginalText,Tally.N,1)

ELSE '' END,

BTW, where did you get [0-Z] from. I was surprised to see that ':' or ';' do not fall within that range when using a collation like Latin1_General_CI_Ax. I scanned BOL, but I can not find anything about this pattern.

Peter

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 7

You may also want to file away this version, which uses an inline table valued function. It performs significantly faster if you have to clean up a large number of rows in a table with one pass. I modified the function above to get rid of anything except 0-9 and tested it against an existing function that does the same thing.

CREATE FUNCTION dbo.itvf_NumbersOnly

(

@Input Varchar(max)

)

RETURNS TABLE

AS

RETURN

(

WITH data as

(select N,substring(@input,N,1) as element

from dbo.tally

where substring(@input,N,1) between '0' and '9' -- alter to include only characters needed

and N <= len(@input)

)

Select(Select ''+element

FROM data

ORDER BY N

FOR XML PATH('')

) as [NumberString]

)

GO

Also, I should note that CLR is said to be significantly faster at string manipulation, but we don't do CLR here yet, so I can't test. There was a huge thread on the subject here. There may be a variation in performance depending on the size of the strings to be cleaned and the number of characters to be replaced.

---------------

-- code to test

---------------

create table #test (rowID int identity primary key, data varchar(max))

insert into #test

select top 100000 newID()

from dbo.tally

select top 10 * from #test

set statistics io on;

set statistics time on;

select rowid,Data,NumberString

into #test2

from #test

cross apply dbo.itvf_NumbersOnly(data)

set statistics time off;

set statistics io off;

print '----------------------------------------------------------------'

set statistics io on;

set statistics time on;

select rowid,data,dbo.StripNonNumeric(data) as NumberString

into #test3

from #test

set statistics time off;

set statistics io off;

select top 10 * from #test2 order by rowid

select top 10 * from #test3 order by rowid

drop table #test

drop table #test2

drop table #test3

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

matzer SSCrazy Eights Points: 9727 More actions · Answer 8

...substring(@input,N,1) between '0' and '9' ...

Still the same mistake. This will match characters like '³' and '²'.

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 9

I see what you mean, but my point wasn't the selection/omission criteria in the where clause. In fact, I learned something from the earlier posts and made a mental note to be aware of potential differences in collation.

The point is the slow performance of a user-defined function compared to an inline table-valued function. At 100,000 rows I was seeing a difference of seconds versus minutes. Take that for what its worth. 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

matzer SSCrazy Eights Points: 9727 More actions · Answer 10

The Dixie Flatline (3/23/2010)
I see what you mean, but my point wasn't the selection/omission criteria in the where clause.

But that's what the thread was about. The same issue was raised in a couple of recent threads so I thought I should mention it.

In fact, I learned something from the earlier posts and made a mental note to be aware of potential differences in collation.

I am learning everyday here, both from the answers as the questions. This is a great site.

The point is the slow performance of a user-defined function compared to an inline table-valued function. At 100,000 rows I was seeing a difference of seconds versus minutes. Take that for what its worth. 🙂

Agreed.

LutzM SSC Guru Points: 107049 More actions · Answer 11

@peter-2:

BTW, where did you get [0-Z] from. I was surprised to see that ':' or ';' do not fall within that range when using a collation like Latin1_General_CI_Ax. I scanned BOL, but I can not find anything about this pattern.

Actually, it's "developed" using the old fashioned trial and error method 😉

I used Lowells code and modified it to have a case clause for numbers as well as characters. Then I just tried to make it even more simple, doing the "contertest" with CHAR(..) for some (but not all) numbers Lowells code excluded.

So, there's no "reliable source"...:-)

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

matzer SSCrazy Eights Points: 9727 More actions · Answer 12

Interesting, again I learned something new today. Thanks for the reply.

Peter

Paul White SSC Guru Points: 150468 More actions · Answer 13

Collation is so important when coding a routine that does a large number of comparisons or sorts, and not just because of the potential for errors as shown in this thread. Performance can be absolutely shocking (up to 40 times worse) when using Windows collations, even with non-Unicode data. Binary collations are fastest of all, but the SQL_* collations are not very far behind at all.

See https://connect.microsoft.com/SQLServer/feedback/details/512459/replace-function-extremely-slow-with-non-binary-windows-collation for details and a reproduction script.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 14

Paul, I have to ask just when is Windows collation used?

I would assume that since all the work is being done inside SQL that some default SQL collation sequence would be used, and you say that performance is not at all bad with SQL collation. Doesn't each server, and DB have a default collation setting?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills