Stripping out all non-numerical characters

Question

Post reply

Stripping out all non-numerical characters

Sachin 80451

SSCommitted

Points: 1642
More actions
December 18, 2012 at 4:22 am

#268349

SELECT accountid,New_Column = REPLACE(telephone1,SUBSTRING(telephone1,PATINDEX('[^0-9]',telephone1),1),'')
INTO #YourNewResults
FROM #TelephoneTable
I have the script above which I thought would remove all non-numerical characters from the field but this not the case?
I have some fields with numbers+characters and some with characters only regardless of the combination I want all characters removed.
I have seen there are a lot of functions is this the best way? If so where would the function be applied in the script?

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

s_osborne2 SSCertifiable Points: 6726 More actions · Answer 1

Hi,

The best way in my opinion would be to use Regex and some CLR functions to replace the characters. Read this article:

http://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/[/url]

Once you've got the functions installed then it's as simple as:

SELECT dbo.RegExReplace('ABC123ABC','[A-Z]','')

MCSE: Data Platform
MCSE: Business Intelligence
Follow me on Twitter: @WazzTheBadger
LinkedIn Profile: Simon Osborne

Jeff Moden SSC Guru Points: 1004341 More actions · Answer 2

~~@Sachin,~~

~~I agree that the CLR Regex method should be the fastest although I've not seen any performance testing on it.~~

~~The questions I have for you is...~~

~~1. Can you use CLR or do you need a 100% T-SQL solution?~~

~~2. Is the maximum length of the data <= 8K bytes?~~

I no longer believe this. In the last 2 weeks, I've seen CLR Regex get its doors blown off by more than one example of some good, ol' fashioned T-SQL. CLR is still the fastest for splitting a string and it might be the fastest for cleaning a string, but CLR Regex probably won't be. Regex itself seems to have been over promised and under delivered. It appears that you pay in performance what you thought you gained in flexibility.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sachin 80451 SSCommitted Points: 1642 More actions · Answer 3

1. I would prefer t-SQL as CLR is way beyond my understanding

2. At the moment I'm not concerned about performance

GSquared SSC Guru Points: 260824 More actions · Answer 4

Here's one way to do it in T-SQL:

DECLARE @String VARCHAR(8000) = 'ABC12D34E56';

WITH Seeds(Seed)

AS (SELECT *

FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),

Numbers(Number)

AS (SELECT TOP (8000)

ROW_NUMBER() OVER (ORDER BY S1.Seed)

FROM Seeds AS S1

CROSS JOIN Seeds AS S2

CROSS JOIN Seeds AS S3

CROSS JOIN Seeds AS S4)

SELECT (

SELECT SUBSTRING(@String, Number, 1)

FROM Numbers

WHERE Number <= LEN(@String)

AND SUBSTRING(@String, Number, 1) LIKE '[0-9]'

ORDER BY Number

FOR XML PATH(''),

TYPE).value('.[1]', 'VARCHAR(8000)');

Alternatively, if you need to apply this to a table instead of a variable:

IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

DROP TABLE #T;

CREATE TABLE #T

(ID INT IDENTITY

PRIMARY KEY,

Col1 VARCHAR(8000));

INSERT INTO #T

(Col1)

VALUES ('123A'),

('B1C2D3');

WITH Seeds(Seed)

AS (SELECT *

FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),

Numbers(Number)

AS (SELECT TOP (8000)

ROW_NUMBER() OVER (ORDER BY S1.Seed)

FROM Seeds AS S1

CROSS JOIN Seeds AS S2

CROSS JOIN Seeds AS S3

CROSS JOIN Seeds AS S4)

SELECT *

FROM #T AS T

CROSS APPLY (SELECT (

SELECT SUBSTRING(Col1, Number, 1)

FROM Numbers

WHERE Number <= LEN(Col1)

AND SUBSTRING(Col1, Number, 1) LIKE '[0-9]'

ORDER BY Number

FOR XML PATH(''),

TYPE).value('.[1]', 'VARCHAR(8000)') AS Stripped) AS Parser;

In either case, it uses a "Runtime Numbers Table" to parse the string into individual characters, then strips out anything other than digits 0-9, then uses For XML to put it all back together. (FOR XML PATH with a zero-length root indicated by (''), and no column name for the query, will concatenate strings together nicely. It's a documented trick that comes in very handy.)

Does that help?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Dwain Camps SSC Guru Points: 86948 More actions · Answer 5

This can also be done using the PatternSplitCM function described in the fourth article in my signature line (Splitting Strings Based on Patterns).

IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

DROP TABLE #T;

CREATE TABLE #T

(ID INT IDENTITY

PRIMARY KEY,

Col1 VARCHAR(8000));

INSERT INTO #T

(Col1)

VALUES ('123A'),

('B1C2D3');

;WITH CTE AS (

SELECT *

FROM #T a

CROSS APPLY PatternSplitCM(a.Col1, '[0-9]')

WHERE [Matched] = 1)

SELECT ID, Col1=MAX(Col1), Col2=(

SELECT '' + Item

FROM CTE b

WHERE a.ID = b.ID

ORDER BY ItemNumber

FOR XML PATH(''))

FROM CTE a

GROUP BY ID

IF OBJECT_ID(N'tempdb..#T') IS NOT NULL

DROP TABLE #T;

In fact the inspiration for that article (described therein) was a forum-posted question that was quite similar to this one.

One caveat though. Since you're working in SQL 2005, you'll need to replace the numbers table with a Ben-Gan style Tally table something like this one (from the PatternSplitQU FUNCTION also in that article):

WITH Nbrs_3(n) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1)

,Nbrs_2 (n) AS (SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2)

,Nbrs_1 (n) AS (SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2)

,Nbrs_0 (n) AS (SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2)

,Tally (n) AS (SELECT ROW_NUMBER() OVER (ORDER BY n) As n FROM Nbrs_0)

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86948 More actions · Answer 6

GSquared (12/18/2012)
Here's one way to do it in T-SQL:
DECLARE @String VARCHAR(8000) = 'ABC12D34E56';
WITH Seeds(Seed)
AS (SELECT *
FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),
Numbers(Number)
AS (SELECT TOP (8000)
ROW_NUMBER() OVER (ORDER BY S1.Seed)
FROM Seeds AS S1
CROSS JOIN Seeds AS S2
CROSS JOIN Seeds AS S3
CROSS JOIN Seeds AS S4)
SELECT (
SELECT SUBSTRING(@String, Number, 1)
FROM Numbers
WHERE Number <= LEN(@String)
AND SUBSTRING(@String, Number, 1) LIKE '[0-9]'
ORDER BY Number
FOR XML PATH(''),
TYPE).value('.[1]', 'VARCHAR(8000)');
Alternatively, if you need to apply this to a table instead of a variable:
IF OBJECT_ID(N'tempdb..#T') IS NOT NULL
DROP TABLE #T;
CREATE TABLE #T
(ID INT IDENTITY
PRIMARY KEY,
Col1 VARCHAR(8000));
INSERT INTO #T
(Col1)
VALUES ('123A'),
('B1C2D3');
WITH Seeds(Seed)
AS (SELECT *
FROM ( VALUES ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1), ( 1) ) AS V (C)),
Numbers(Number)
AS (SELECT TOP (8000)
ROW_NUMBER() OVER (ORDER BY S1.Seed)
FROM Seeds AS S1
CROSS JOIN Seeds AS S2
CROSS JOIN Seeds AS S3
CROSS JOIN Seeds AS S4)
SELECT *
FROM #T AS T
CROSS APPLY (SELECT (
SELECT SUBSTRING(Col1, Number, 1)
FROM Numbers
WHERE Number <= LEN(Col1)
AND SUBSTRING(Col1, Number, 1) LIKE '[0-9]'
ORDER BY Number
FOR XML PATH(''),
TYPE).value('.[1]', 'VARCHAR(8000)') AS Stripped) AS Parser;
In either case, it uses a "Runtime Numbers Table" to parse the string into individual characters, then strips out anything other than digits 0-9, then uses For XML to put it all back together. (FOR XML PATH with a zero-length root indicated by (''), and no column name for the query, will concatenate strings together nicely. It's a documented trick that comes in very handy.)
Does that help?

Gus - I think the OP may have a problem with your Tally table (Seeds CTE) for the same reason I noted above.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jeff Moden SSC Guru Points: 1004341 More actions · Answer 7

Jeff Moden (12/18/2012)
@Sachin,
I agree that the CLR Regex method should be the fastest although I've not seen any performance testing on it.
The questions I have for you is...
1. Can you use CLR or do you need a 100% T-SQL solution?
2. Is the maximum length of the data <= 8K bytes?

Actually, I have to take that back. CLR would be the fastest method but only if you DON"T use Regex. I've recently seen many performance tests and either some dedicated CLR or some dedicated T-SQL will usually smoke RegEx according to the tests I've seen.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004341 More actions · Answer 8

Sachin 80451 (12/18/2012)
1. I would prefer t-SQL as CLR is way beyond my understanding
2. At the moment I'm not concerned about performance

I whole heartedly agree with #1 above. I never agree with #2 above because they always come back about the performance problem that a given solution is having. The only thing you should be more concerned about than performance is accuracy and the two should be virtually tied as being the most important. If you think not, spend an hour or two looking at the thousands of performance issues people are asking about on this forum alone.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Dwain Camps SSC Guru Points: 86948 More actions · Answer 9

Jeff Moden (12/25/2012)
Jeff Moden (12/18/2012)
@Sachin,
I agree that the CLR Regex method should be the fastest although I've not seen any performance testing on it.
The questions I have for you is...
1. Can you use CLR or do you need a 100% T-SQL solution?
2. Is the maximum length of the data <= 8K bytes?
Actually, I have to take that back. CLR would be the fastest method but only if you DON"T use Regex. I've recently seen many performance tests and either some dedicated CLR or some dedicated T-SQL will usually smoke RegEx according to the tests I've seen.

I've got your back Jeff.

Here's one example: http://www.sqlservercentral.com/Forums/Topic1390297-3122-5.aspx

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Sachin 80451 SSCommitted Points: 1642 More actions · Answer 10

I have no experience in CRL, so would like T-SQL for now please?

the answer to the second question is yes.

Your Name Here SSCrazy Points: 2344 More actions · Answer 11

@GSquared:

Sorry to be thick-headed (read: slow-witted) but I have to ask: why did you SELECT TOP (8000) ROW_NUMBER... when it [seems] to work with TOP (10). Since we're using only the numbers 0-9, I would have figured using a generic "number generator" like SELECTING ROW_NUMBER() ... from a "known" source that would have at least 10 records (like sys.objects), you could skinny down the code and maybe processing as well:

DECLARE @String VARCHAR(8000) = 'ABC12D34E56';

SELECT (

SELECT SUBSTRING(@String, n.[Number], 1)

FROM (

SELECT TOP (10) ROW_NUMBER()

OVER (ORDER BY [object_id]) AS [Number]

FROM sys.objects) n

WHERE n.[Number] <= LEN(@String)

AND SUBSTRING(@String, n.[Number], 1) LIKE '[0-9]'

ORDER BY n.[Number]

FOR XML PATH(''), TYPE).[value]('.[1]', 'VARCHAR(8000)');

Factoring in inflation, health care, taxes and cost of living, my two cents isn't worth much so I apologize if I missed something. I'll blame the lack of coffee for any oversight on my part.

Cheers,

Ken

Jeff Moden SSC Guru Points: 1004341 More actions · Answer 12

Your Name Here (1/4/2013)
@GSquared:
Sorry to be thick-headed (read: slow-witted) but I have to ask: why did you SELECT TOP (8000) ROW_NUMBER... when it [seems] to work with TOP (10). Since we're using only the numbers 0-9, I would have figured using a generic "number generator" like SELECTING ROW_NUMBER() ... from a "known" source that would have at least 10 records (like sys.objects), you could skinny down the code and maybe processing as well:
DECLARE @String VARCHAR(8000) = 'ABC12D34E56';
SELECT (
SELECT SUBSTRING(@String, n.[Number], 1)
FROM (
SELECT TOP (10) ROW_NUMBER()
OVER (ORDER BY [object_id]) AS [Number]
FROM sys.objects) n
WHERE n.[Number] <= LEN(@String)
AND SUBSTRING(@String, n.[Number], 1) LIKE '[0-9]'
ORDER BY n.[Number]
FOR XML PATH(''), TYPE).[value]('.[1]', 'VARCHAR(8000)');
Factoring in inflation, health care, taxes and cost of living, my two cents isn't worth much so I apologize if I missed something. I'll blame the lack of coffee for any oversight on my part.
Cheers,
Ken

Try it on a much larger string and see why the 8000 becomes important.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Your Name Here SSCrazy Points: 2344 More actions · Answer 13

Your Name Here

SSCrazy

Points: 2344

January 4, 2013 at 11:45 am

#1573891

:exclamationmark:

I get it now. Thanks!