Remove non alpha chars from a column

Question

Remove non alpha chars from a column

Viewing 13 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 1

CELKO (9/30/2014)
I use nested REPLACE() functions. Some day we might have the ANSI/ISO Standard TRANSLATE()

Hi Joe,

I agree that using nested replace functions are the berries for performance especially when replacing multiple adjacent spaces with just one. Would that be practical in this case? You'd only be keeping 52 out of 256 characters (ASCII characters) and that's if it's not Unicode. I haven't actually tested it but would seem that the 254 nested replaces required would be a bit of a performance problem even for nested replaces.

Does your alpha-only function have that many replaces or is there another trick to it?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bgalway Ten Centuries Points: 1158 More actions · Answer 2

Not sure if this query extract is helpful but it helped me in a recent situation where I was trying to deal with non-alpha characters:

WHERE ([Bill Freq] IS NOT NULL) AND ([Bill Freq] NOT LIKE '^[A-Za-z0-9]+$') AND ([Bill Freq] NOT LIKE '')

ScottPletcher SSC Guru Points: 101224 More actions · Answer 3

The original version I wrote of this was in 2006 for Experts Exchange, and it was to strip nonnumeric chars (although of course the basic idea is the same):

http://www.experts-exchange.com/Database/MS-SQL-Server/Q_21957163.html

Of course unfortunately you'll now need an EE membership -- or free trial -- to see it, grrrr.

Edit: And of course it's certainly possible someone else wrote similar code before that. If they did, I'm unaware of it.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 4

bgalway (9/30/2014)
Not sure if this query extract is helpful but it helped me in a recent situation where I was trying to deal with non-alpha characters:
WHERE ([Bill Freq] IS NOT NULL) AND ([Bill Freq] NOT LIKE '^[A-Za-z0-9]+$') AND ([Bill Freq] NOT LIKE '')

The idea isn't to find things... it's to fix things by changing the actual data. Also, your test full null is likely not necessary because, unless you've changed the server defaults or have the appropriate SET statement somewhere in your code, NULLs will naturally be excluded because they cannot be compared even to each other.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Alan Burstein SSC Guru Points: 61152 More actions · Answer 5

Some day we might have the ANSI/ISO Standard TRANSLATE()

In the meantime you can use the one that I created.

I am working on an article I plan to submit to SSC and was reluctant to share this without a little more testing but, perhaps, this would be a good opportunity to get some feedback/suggestions.

One of them uses this simple getnums function:

CREATE FUNCTION dbo.getnums(@rows int)

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

(

WITH

L1 AS (SELECT n=n FROM (values (1),(1)) t(n)),

L2 AS (SELECT n=1 FROM L1 a CROSS APPLY L1 b),

L3 AS (SELECT n=1 FROM L2 a CROSS APPLY L2 b),

L4 AS (SELECT n=1 FROM L3 a CROSS APPLY L3 b),

L5 AS (SELECT n=1 FROM L4 a CROSS APPLY L4 b),

iTally AS

(

SELECT n = row_number() over (order by (select null))

FROM L5 a CROSS APPLY L5 b

)

SELECT TOP (@rows) n

FROM itally

ORDER BY n

);

GO

Here are the first three versions (yes, we all know how bad loops are, I'll explain why I created one with a loop in a moment...)

;-- (1) dbo.loopTranslate (scalar using a loop)

CREATE FUNCTION dbo.loopTranslate

(@string varchar(8000),

@replace varchar(100),

@with varchar(100)

)

/*

-- Use

DECLARE @string varchar(100) = '##!!xyxxyyzz!!##';

SELECT [@string]= @string, newstring = dbo.loopTranslate(@string,'xyz#!','abc')

*/

RETURNS varchar(8000)

AS

BEGIN

DECLARE @i int = 1;

WHILE @i <= len(@replace)

BEGIN

SELECT@string = replace(@string,substring(@replace,@i,1),substring(@with,@i,1)),

@i = @i+1

END

RETURN @string

END;

GO

-- (2) dbo.svfTranslate (scalar using a tally table)

CREATE FUNCTION dbo.svfTranslate

(@string varchar(8000),

@replace varchar(100),

@with varchar(100)

)

/*

-- Use

DECLARE @string varchar(100) = '##!!xyxxyyzz!!##';

SELECT [@string]= @string, newstring = dbo.svfTranslate(@string,'xyz#!','abc')

*/

RETURNS varchar(8000) AS

BEGIN

SELECT @string = replace(@string,substring(@replace,n,1),substring(@with,n,1))

FROM dbo.getnums(len(@replace));

RETURN @string;

END;

GO

-- (3) itvfTranslate (inline tvf using recursive cte)

CREATE FUNCTION dbo.itvfTranslate

(@string varchar(8000),

@replace varchar(100),

@with varchar(100)

)

/*

-- Use

DECLARE @string varchar(100) = '##!!xyxxyyzz!!##';

SELECT [@string]= @string, newstring = x

FROM dbo.itvfTranslate(@string,'xyz#!','abc')

*/

RETURNS table AS

RETURN

(

WITH a AS

(

SELECT n=1, x=replace(@string,substring(@replace,1,1),substring(@with,1,1))

UNION ALL

SELECT n+1, x=replace(x,substring(@replace,n+1,1),substring(@with,n+1,1))

FROM a

WHERE n < len(@replace)

)

SELECT x

FROM a

WHERE n = len(@replace)

);

GO

--Note: I also did a couple variations using the "Quirky Update" method but they performed badly...

...and here's one using a scalar recursive function (note the commented section for examples of use):

-- (4) rTranslate (recursive scalar)

CREATE FUNCTION dbo.rTranslate

(@s varchar(1000)(8000),

@r varchar(32),

@w varchar(32),

@i int=1

)

/*

Created by: Alan Burstein

Created on: 9/2/2014

Usage Examples:

-- (1) basic replace and remove (both functions)

-- remove $ and #, Replace a with A, c with C, and b with x...

DECLARE @string varchar(20)='###$$$aaabbbccc$$$###',

@pre varchar(5)='acb#$',

@post varchar(5)='ACx';

SELECToriginal = @string,

Translated = dbo.rTranslate(@string,@pre,@post,1);

GO

-- (2) format phone numbers

-- (a) format phone (atomic value)

DECLARE @string varchar(20)='(425)555-1212',

@pre varchar(5)=')(', @post varchar(5)='-';

SELECToriginal = @string,

Translated = dbo.rTranslate(@string,@pre,@post,1);

-- (b) format phone numbers(from table)

WITH phoneNbrs(n,pn) AS

(SELECT 1, '(425)555-1212' UNION ALL

SELECT 2, '425.555.1212' UNION ALL

SELECT 3, '425_555_1212' UNION ALL

SELECT 4, '(425)555.1212' )

SELECT n, pn AS before, [after] = dbo.rTranslate(pn,x,y,1)

FROM phoneNbrs

CROSS APPLY (VALUES('.)_('+char(32),'---')) t(x,y);

-- (c) hide phone numbers

WITH phoneNbrs(n,pn) AS

(SELECT 1, '(425) 555-1212' UNION ALL

SELECT 2, '425.555.1212' UNION ALL

SELECT 3, '425-555-1212' UNION ALL

SELECT 4, '4255551212' )

SELECT n, pn AS before, [after] = dbo.rTranslate(pn,x,y,1)

FROM phoneNbrs

CROSS APPLY (VALUES('()-.0123456789','()-.**********')) t(x,y)

GO

-- (3) Replace accent characters with normal characters,

DECLARE @string varchar(100)='Thë Quìck Greeñ Fox jumpëd over thë lázy dogs back!',

@special1 varchar(32)='áâãäæèïéìëíîçåñòóôöõàøúüûùýÁÃÄ',

@special2 varchar(32)='ÆÈÏÉÌËÍÎÅÑÒÓÔÖÕÀØÚÜÛÙÝ!',

@normal1 varchar(32)='aaaaaeieieiicanoooooaouuuuyAAAAA',

@normal2 varchar(32)='EIEIEIIANOOOOOAOUUUUY.';

SELECT@string AS original,

newstring = dbo.rTranslate(dbo.rTranslate(@string,@special1,@normal1,1),@special2,@normal2,1);

-- note how I deal with the recursion limit, note the query plan

GO

-- (4) using each to find a pattern in a set of strings...

-- looking for the string/pattern combination that will produce ABCABC

DECLARE @strings TABLE(string varchar(20) not null);

DECLARE @patterns TABLE(p_id int identity primary key,

pattern varchar(10) not null,

p_replace varchar(10) not null);

DECLARE @SearchFor varchar(10)='ABCABC';

INSERT @strings VALUES('123abcABC1'),('222defACC'),('123ABCmno');

INSERT @patterns VALUES('123','E'),('mno','XYZ'),('abc123','ABC'),('ABC','XYZ');

WITH pattern_search AS

(

SELECT string, pattern, p_replace, newstring = dbo.rTranslate(string,pattern,p_replace,1)

FROM @strings s

CROSS APPLY @patterns p

)

SELECT *, [matched] = case newstring when @SearchFor then 'yes' else 'no' end

FROM pattern_search;

*/

RETURNS varchar(8000)

AS

BEGIN

RETURN

(CASE

WHEN @i <= len(@r)

THEN dbo.rTranslate(replace(@s,substring(@r,@i,1),substring(@w,@i,1)),@r,@w,@i+1)

ELSE @s

END

)

END

GO

The downside to this guy, of course, is the 32 char limit but this can be circumvented using nesting; note example #3 in my code

dbo.rTranslate(dbo.rTranslate(@string,@special1,@normal1,1),@special2,@normal2,1);...

The test harness I've been using:

IF OBJECT_ID('tempdb..#vals') IS NOT NULL DROP TABLE #vals;

SELECT val = replicate(cast(newid() as char(36)),200)

INTO #vals

FROM dbo.getnums(5000);

GO

--SELECT * FROM #vals

SET NOCOUNT ON;

dbcc freeproccache

dbcc dropcleanbuffers

DECLARE @r varchar(20)='ABCDEF190',

@w varchar(20)='00000000';

DECLARE @x varchar(8000);

SET STATISTICS TIME ON;

--SET STATISTICS IO ON;

PRINT char(13)+'svf loop version:'

SELECT @x = dbo.loopTranslate(val, @r, @w)

FROM #vals;

PRINT char(13)+'svf tally version:'

SELECT @x = dbo.svfTranslate(val, @r, @w)

FROM #vals;

PRINT char(13)+'itvf recursive cte version:'

SELECT @x = x

FROM #vals

CROSS APPLY dbo.itvfTranslate(val, @r, @w);

PRINT char(13)+'recursive version:'

SELECT @x = dbo.rTranslate(val, @r, @w, 1)

FROM #vals;

-- add a replace() function for testing

SET STATISTICS TIME OFF;

--SET STATISTICS IO OFF;

GO

and the results:

svf loop version:

SQL Server Execution Times:

CPU time = 4742 ms, elapsed time = 4792 ms.

svf tally version:

SQL Server Execution Times:

CPU time = 4665 ms, elapsed time = 4735 ms.

itvf recursive cte version:

SQL Server Execution Times:

CPU time = 5460 ms, elapsed time = 5561 ms.

recursive version:

SQL Server Execution Times:

CPU time = 812 ms 4665, elapsed time = 871 ms 4680.

I included the loop version to show how badly the other other 2 performed. I have tested the recursive version with various string lengths and row counts and the results are the same: ~~the recursive scalar version is 5-10 times faster. The code is simple as is the query plan:~~ never mind, they're all basically the same - the recursive scalar version however is the cleanest as is the query plan.

Edit: I realized that the recursive version was taking an input @string input of varchar(1000) vs. varchar(8000) for the other three. Nonetheless, here's a few examples of how to create a translate() function in SQL Server 2000 through SQL Server 2014. For anyone interested.

Final edit: just read through the thread and saw Jeff's comment...

"Note that we cannot use "SET STATISTICS" to measure performance here because there's a scalar function involved. "SET STATISTICS" greatly skews the performance results when scalar functions are present making the scalar function look a whole lot worse than it actually is. Please see the following article for more on that little testing nuance..."

I'll need to re-do my test.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 6

Ah... my apologies. I had to withdraw my post that was in this spot. At first, it looked like your CROSS APPLY "GetNums" was twice as fast as the traditional CROSS JOIN method. Much to my embarrassment, I found that I had phat-phingered the number "2000000" instead of "1000000" on the traditional method.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

DonlSimpson SSCertifiable Points: 7043 More actions · Answer 7

Maybe I'm missing something here, but this is simple and fast:

declare @t varchar(5000) = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'

declare @s varchar(5000) = ''

select @s = @s + substring(@t, i, 1)

from dbo.fnTalleyTable(len(isnull(@t, ''))) tt

where substring(@t, i, 1) like '[a-z]'

select @s

Don

Don Simpson

I'm not sure about Heisenberg.

Dwain Camps SSC Guru Points: 86978 More actions · Answer 8

DonlSimpson (10/2/2014)
Maybe I'm missing something here, but this is simple and fast:
declare @t varchar(5000) = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'
declare @s varchar(5000) = ''
select @s = @s + substring(@t, i, 1)
from dbo.fnTalleyTable(len(isnull(@t, ''))) tt
where substring(@t, i, 1) like '[a-z]'
select @s
Don

Not recommended.

Multi-Row Variable Assignment and ORDER BY

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Luis Cazares SSC Guru Points: 183706 More actions · Answer 9

The problem with your code, Don, is that it only handles scalar values.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 10

dwain.c (10/2/2014)
DonlSimpson (10/2/2014)
Maybe I'm missing something here, but this is simple and fast:
declare @t varchar(5000) = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'
declare @s varchar(5000) = ''
select @s = @s + substring(@t, i, 1)
from dbo.fnTalleyTable(len(isnull(@t, ''))) tt
where substring(@t, i, 1) like '[a-z]'
select @s
Don
Not recommended.
Multi-Row Variable Assignment and ORDER BY

Hmmm... Itzik used concatenation of a column of data in his example went haywire. Does anyone have an example of this type of code going crazy?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

DonlSimpson SSCertifiable Points: 7043 More actions · Answer 11

All right then, how about replacing the explicit string concatenation with "for xml path"?

declare @originalText varchar(7500), @includeCharacters varchar(20), @cleanText varchar(7500)

select @originalText = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'

, @includeCharacters = '[a-z]', @cleanText = ''

select @cleanText =

(

select substring(@originalText, i, 1)

from dbo.fnTalleyTable(len(isnull(@originalText, ''))) tt

where substring(@originalText, i, 1) like @includeCharacters

for xml path ('')

)

select @cleanText

Don

Don Simpson

I'm not sure about Heisenberg.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 12

Heh... here I go ago... necro'ing the same wicked old post.

@DonlSimpson ... Try the following and see why I prefer not to use XML for this type of thing.

declare @originalText varchar(7500), @includeCharacters varchar(20), @cleanText varchar(7500)

select @originalText = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." <<&whoopsi&>> This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'
, @includeCharacters = '[a-z<>&]', @cleanText = ''

select @cleanText =
   (
   select substring(@originalText, i, 1)
   from dbo.fnTalleyTable(len(isnull(@originalText, ''))) tt
   where substring(@originalText, i, 1) like @includeCharacters
      for xml path ('')
   )

select @cleanText

Of course, it the code were for alpha-numeric only, that problem wouldn't show up.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 13

Shifting gears a bit,.. Lordy, I miss Dwain. :unsure:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)