Removing non-alphabetic characters from a column

Question

Post reply

Removing non-alphabetic characters from a column

sj999

SSC Veteran

Points: 257
More actions
September 4, 2012 at 4:17 am

#261031

I have a single column table C
column is fname varchar(255)
I want to remove dirty data ie anything that is
not alphabetic char.
I cannot do this with a replace as I dont know what the
character is in advance.
I need to do this with looping ideally using common table expressions
Any idea?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Ant-Green SSC Guru Points: 113509 More actions · Answer 1

You could use a function, built with PATINDEX

Create Function [dbo].[RemoveAlphaCharacters](@Temp VarChar(1000))

Returns VarChar(1000)

AS

Begin

While PatIndex('%[a-z]%', @Temp) > 0

Set @Temp = Stuff(@Temp, PatIndex('%[a-z]%', @Temp), 1, '')

Return @Temp

End

Using value sjhrh1049jsjf303

select dbo.removealphacharacters ('sjhrh1049jsjf303')

Returns

1049303

sj999 SSC Veteran Points: 257 More actions · Answer 2

How do I do this for all the values in my column fname?

Ant-Green SSC Guru Points: 113509 More actions · Answer 3

pass in your column name instead of the string

select dbo.removealphacharacters (fname) from SomeTableWhichContainsAColumnCalledFname

Jeff Moden SSC Guru Points: 1004432 More actions · Answer 4

sj999 (9/4/2012)
I need to do this with looping ideally using common table expressions
Any idea?

Yes... you don't need typical looping for this. You need a high speed psuedo-cursor in the form of a CROSS JOIN with a Tally Table or a cteTally. This will also allow an all-in-one query that will allow you to build the function as an iTVF or "Inline Table Value Function" for a real boost in performance.

First, see the article at the following link for what a Tally Table is and how it can be used to replace certain types of loops like this one.

http://www.sqlservercentral.com/articles/T-SQL/62867/

Next, build a permanent Tally Table. Like this...

--===================================================================

-- Create a Tally table from 1 to 11000

--===================================================================

--===== Create and populate the Tally table on the fly.

SELECT TOP 11000

IDENTITY(INT,1,1) AS N

INTO dbo.Tally

FROM Master.sys.ALL_Columns ac1

CROSS JOIN Master.sys.ALL_Columns ac2

;

--===== Add a CLUSTERED Primary Key to maximize performance

ALTER TABLE dbo.Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

;

--===== Allow the general public to use it

GRANT SELECT ON dbo.Tally TO PUBLIC

;

GO

Then, create the following general purpose string cleaner (Limited to VARCHAR(8000))...

CREATE FUNCTION dbo.CleanString

(@pString VARCHAR(8000),@pKeepPattern VARCHAR(8000))

RETURNS TABLE

AS

RETURN

SELECT CleanedString =

(

SELECT SUBSTRING(@pString, t.N, 1)

FROM dbo.Tally t

WHERE t.n <= LEN(@pString)

AND SUBSTRING(@pString, t.N, 1) LIKE @pKeepPattern COLLATE Latin1_General_BIN

ORDER BY t.n

FOR XML PATH('')

)

;

After that, everything gets easy...

SELECT yt.OriginalString,c.CleanedString

FROM dbo.YourTable yt

CROSS APPLY dbo.CleanString(yt.OriginalString,'[^A-Za-z]')

;

That code will keep everything except the letters of the alphabet.

If you want to keep only the numeric digets of 0 through 9, the call would be similar using a different LIKE filter.

SELECT yt.OriginalString,c.CleanedString

FROM dbo.YourTable yt

CROSS APPLY dbo.CleanString(yt.OriginalString,'[0-9]')

;

If you prefer to not use a Tally Table (it's really good for rapid and simple development), you can build the high speed no-reads equivalent of a Tally Table using a rendition of Itzek Ben-Gans cCTEs (Cascading CTEs) in the function like this. It's a bit more complicated but it produces virtually zero reads and is about as fast as using a permanent Tally Table.

CREATE FUNCTION dbo.CleanString

(@pString VARCHAR(8000),@pKeepPattern VARCHAR(8000))

RETURNS TABLE

AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

-- enough to cover VARCHAR(8000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

--===== Clean the string according to the "keep" pattern.

SELECT CleanedString =

(

SELECT SUBSTRING(@pString, t.N, 1)

FROM cteTally t

WHERE t.n <= LEN(@pString)

AND SUBSTRING(@pString, t.N, 1) LIKE @pKeepPattern COLLATE Latin1_General_BIN

ORDER BY t.n

FOR XML PATH('')

)

;

Whether you use the Tally Table version or the cteTally version, both are in a high performance iTVF. As you've seen, it requires a slightly different usage than a slower scalar function but it's worth it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)