March 23, 2010 at 8:59 am
--drop table #addr
create table #addr
( id int identity (1,1)
, address1 varchar(200)
)
insert into #addr
select '532 40th ave. n.e. apt)104'
union all
select '1000 sector way st. apt)111-ac))'
union all
select '3333 summer-point place'
union all
select 'p.o.box 22000a-330042'
union all
select '5566 first stett n. lot.204-a apt)2-c'
--should return only alpha numeric chars.
'53240thaveneapt104'
'1000sectorwaystapt111ac'
'3333summerpointplace'
'pobox22000a330042'
'5566firststettnlot204aapt2c'
March 23, 2010 at 10:03 am
i think this will do what you are after; this leaves 0-9,A-Z,a-z and strips out everything else...commas,punctuation, hi ascii, etc.
select dbo.StripNonAlphaNumeric(address1) from #addr
--results
53240thaveneapt104
1000sectorwaystapt111ac
3333summerpointplace
pobox22000a330042
5566firststettnlot204aapt2c
select ascii('a'),ascii('z'),ascii('A'),ascii('Z')
GO
CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000)
;WITH tally (N) as
(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii upper case letters A-Z is 65 thru 90
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90
THEN SUBSTRING(@OriginalText,Tally.N,1)
--ascii lower case letters a-z is 97 thru 122
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
FROM tally WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
Lowell
March 23, 2010 at 11:02 am
That's exactly what i was looking for. Thanks Lowel. I knew there was a way via ascii but hadn't done it before. This ones going in the tool box.
Again thank you!
March 23, 2010 at 12:52 pm
Maybe you could make the CASE statement of Lowells nice function a little shorter:
CASE
WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
It seems like it'll return the same value.
March 23, 2010 at 1:37 pm
lmu92 (3/23/2010)
Maybe you could make the CASE statement of Lowells nice function a little shorter:
CASE
WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
It seems like it'll return the same value.
It seems, but it really depends on the collation being used. Check out the following script:
DECLARE @OriginalText VARCHAR(MAX) = '0;:ëêà^AaZz'
SELECT
CASE
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END,
CASE
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AS,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END,
CASE
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AI,Tally.N,1)) like '[0-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END
FROM
(
SELECT Number N FROM master..spt_values WHERE Type = 'P'
) Tally
where
Tally.N <= LEN(@OriginalText)
If you really want to remove any character but 0-9, a-z, A-Z compare the Ascii-codes. Also note that in a binary collation ':' and ';' fall between [0-Z].
Peter
March 23, 2010 at 1:46 pm
Thanx for clarification, Peter!
I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement... :blush:
March 23, 2010 at 2:18 pm
lmu92 (3/23/2010)
Thanx for clarification, Peter!I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement... :blush:
Well, you were close. I think this one will do the job too
CASE
WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-9a-zA-Z]'
THEN SUBSTRING(@OriginalText,Tally.N,1)
ELSE '' END,
BTW, where did you get [0-Z] from. I was surprised to see that ':' or ';' do not fall within that range when using a collation like Latin1_General_CI_Ax. I scanned BOL, but I can not find anything about this pattern.
Peter
March 23, 2010 at 2:59 pm
You may also want to file away this version, which uses an inline table valued function. It performs significantly faster if you have to clean up a large number of rows in a table with one pass. I modified the function above to get rid of anything except 0-9 and tested it against an existing function that does the same thing.
CREATE FUNCTION dbo.itvf_NumbersOnly
(
@Input Varchar(max)
)
RETURNS TABLE
AS
RETURN
(
WITH data as
(select N,substring(@input,N,1) as element
from dbo.tally
where substring(@input,N,1) between '0' and '9' -- alter to include only characters needed
and N <= len(@input)
)
Select(Select ''+element
FROM data
ORDER BY N
FOR XML PATH('')
) as [NumberString]
)
GO
Also, I should note that CLR is said to be significantly faster at string manipulation, but we don't do CLR here yet, so I can't test. There was a huge thread on the subject here. There may be a variation in performance depending on the size of the strings to be cleaned and the number of characters to be replaced.
---------------
-- code to test
---------------
create table #test (rowID int identity primary key, data varchar(max))
insert into #test
select top 100000 newID()
from dbo.tally
select top 10 * from #test
set statistics io on;
set statistics time on;
select rowid,Data,NumberString
into #test2
from #test
cross apply dbo.itvf_NumbersOnly(data)
set statistics time off;
set statistics io off;
print '----------------------------------------------------------------'
set statistics io on;
set statistics time on;
select rowid,data,dbo.StripNonNumeric(data) as NumberString
into #test3
from #test
set statistics time off;
set statistics io off;
select top 10 * from #test2 order by rowid
select top 10 * from #test3 order by rowid
drop table #test
drop table #test2
drop table #test3
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 23, 2010 at 3:22 pm
...substring(@input,N,1) between '0' and '9' ...
Still the same mistake. This will match characters like '³' and '²'.
March 23, 2010 at 4:53 pm
I see what you mean, but my point wasn't the selection/omission criteria in the where clause. In fact, I learned something from the earlier posts and made a mental note to be aware of potential differences in collation.
The point is the slow performance of a user-defined function compared to an inline table-valued function. At 100,000 rows I was seeing a difference of seconds versus minutes. Take that for what its worth. 🙂
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 23, 2010 at 5:16 pm
The Dixie Flatline (3/23/2010)
I see what you mean, but my point wasn't the selection/omission criteria in the where clause.
But that's what the thread was about. The same issue was raised in a couple of recent threads so I thought I should mention it.
In fact, I learned something from the earlier posts and made a mental note to be aware of potential differences in collation.
I am learning everyday here, both from the answers as the questions. This is a great site.
The point is the slow performance of a user-defined function compared to an inline table-valued function. At 100,000 rows I was seeing a difference of seconds versus minutes. Take that for what its worth. 🙂
Agreed.
March 23, 2010 at 5:38 pm
BTW, where did you get [0-Z] from. I was surprised to see that ':' or ';' do not fall within that range when using a collation like Latin1_General_CI_Ax. I scanned BOL, but I can not find anything about this pattern.
Actually, it's "developed" using the old fashioned trial and error method 😉
I used Lowells code and modified it to have a case clause for numbers as well as characters. Then I just tried to make it even more simple, doing the "contertest" with CHAR(..) for some (but not all) numbers Lowells code excluded.
So, there's no "reliable source"...:-)
March 23, 2010 at 5:45 pm
Interesting, again I learned something new today. Thanks for the reply.
Peter
March 24, 2010 at 5:17 am
Collation is so important when coding a routine that does a large number of comparisons or sorts, and not just because of the potential for errors as shown in this thread. Performance can be absolutely shocking (up to 40 times worse) when using Windows collations, even with non-Unicode data. Binary collations are fastest of all, but the SQL_* collations are not very far behind at all.
See https://connect.microsoft.com/SQLServer/feedback/details/512459/replace-function-extremely-slow-with-non-binary-windows-collation for details and a reproduction script.
March 24, 2010 at 6:06 am
Paul, I have to ask just when is Windows collation used?
I would assume that since all the work is being done inside SQL that some default SQL collation sequence would be used, and you say that performance is not at all bad with SQL collation. Doesn't each server, and DB have a default collation setting?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply