TSQL RegEx find and replace without CLR

Question

Post reply

TSQL RegEx find and replace without CLR

Shaun-884394

Ten Centuries

Points: 1096
More actions
May 11, 2012 at 1:25 am

#255808

Hi,
I am having a table with a column which is having data like # followed by some number for Eg. #123.
I want to strip out the # and the number from the column using TSQL without using clr regex replace.
Any help here would be highly appreciated.
Script for generating test data is given below.
DECLARE @BadData TABLE(DataID INT, Data VARCHAR(50), ExpectedData VARCHAR(50))
INSERT INTO @BadData(DataID,Data,ExpectedData)
SELECT 1,'#123 data','data' UNION
SELECT 2,'#456data','data' UNION
SELECT 3,'data #789','data' UNION
SELECT 4,'data#012','data' UNION
SELECT 5,'data#345 some more data','data some more data' UNION
SELECT 6,'data #678some more data','data some more data' UNION
SELECT 7,'data #901 some more data','data some more data' UNION
SELECT 8,'data#234some more data','datasome more data'
SELECT * FROM @BadData
The column 'Data' has bad values and the column 'ExpectedData' has the expected values.
I will soon post the queries i have tried so far but they didn't work, hence posting it to the forum.

Viewing 14 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Ant-Green SSC Guru Points: 113567 More actions · Answer 1

there was an article posted this week in relation to RegEx for DBA's, it can be found here[/url]

Shaun-884394 Ten Centuries Points: 1096 More actions · Answer 2

Thanks for the link, but i was looking for a non-clr solution.

Shaun-884394 Ten Centuries Points: 1096 More actions · Answer 3

I have tried the below query but it does not cover all the cases, especially when the pattern is like #[number][string]

The output of query is attached to this post.

SELECT

DataID

,Data

,LEN(Data) AS [Length of Data]

,CHARINDEX('#',Data) AS [Position of #]

,CHARINDEX(' ',Data) AS [Position of space]

,CHARINDEX(' ',Data,CHARINDEX('#',Data)) AS [Position of space after #]

,CASE

WHEN (CHARINDEX(' ',Data,CHARINDEX('#',Data)) > 0)

THEN SUBSTRING(Data,CHARINDEX('#',Data),CHARINDEX(' ',Data,CHARINDEX('#',Data))-CHARINDEX('#',Data))

ELSE SUBSTRING(Data,CHARINDEX('#',Data),LEN(Data)-CHARINDEX('#',Data))

END AS [The word containing #]

,CASE

WHEN (CHARINDEX(' ',Data,CHARINDEX('#',Data)) > 0)

THEN LTRIM(RTRIM(REPLACE(Data,SUBSTRING(Data,CHARINDEX('#',Data),CHARINDEX(' ',Data,CHARINDEX('#',Data))-CHARINDEX('#',Data)),'')))

ELSE LTRIM(RTRIM(REPLACE(Data,SUBSTRING(Data,CHARINDEX('#',Data),LEN(Data)-CHARINDEX('#',Data)),'')))

END AS [Output]

,ExpectedData

FROM

@BadData

Sean Lange SSC Guru Points: 286609 More actions · Answer 4

How about nested replace. You said all you want is to remove "#" and any numbers.

SELECT rtrim(ltrim(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(REPLACE(Data, '#', ''), '1', ''), '2', ''), '3', ''), '4', ''), '5', ''), '6', ''), '7', ''), '8', ''), '9', ''), '0', '')))

, ExpectedData

FROM @BadData

The nested replace approach is easy to read and understand, it also has the advantage that is super fast. Then just trim left and right. This at least works on your sample data.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 5

Have to ask, is it always a # and three digits or is that just the way you configured your sample data?

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 6

Here is a shot at it:

DECLARE @BadData TABLE(DataID INT, Data VARCHAR(50), ExpectedData VARCHAR(50));

INSERT INTO @BadData(DataID,Data,ExpectedData)

SELECT 1,'#123 data','data' UNION

SELECT 2,'#456data','data' UNION

SELECT 3,'data #789','data' UNION

SELECT 4,'data#012','data' UNION

SELECT 5,'data#345 some more data','data some more data' UNION

SELECT 6,'data #678some more data','data some more data' UNION

SELECT 7,'data #901 some more data','data some more data' UNION

SELECT 8,'data#234some more data','datasome more data' UNION

SELECT 9,'data#2some more data','datasome more data';

--SELECT * FROM @BadData;

SELECT

SUBSTRING(Data,1,PATINDEX('%#[0-9]%', Data) - 1) +

REVERSE(SUBSTRING(REVERSE(SUBSTRING(Data,PATINDEX('%#[0-9]%', Data) + 1,DATALENGTH(Data) - PATINDEX('%#[0-9]%', Data))),1,PATINDEX('%[0-9]%',REVERSE(SUBSTRING(Data,PATINDEX('%#[0-9]%', Data) + 1,DATALENGTH(Data) - PATINDEX('%#[0-9]%', Data)))) - 1)),

Data,ExpectedData

FROM

@BadData;

Sean Lange SSC Guru Points: 286609 More actions · Answer 7

Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 8

Shaun-884394 (5/11/2012)
Thanks for the link, but i was looking for a non-clr solution.

Any particular reason why not? Let me guess...political reasons in your organization?

RegEx is one of the primary use-cases for using SQLCLR in the database and there are tons of good examples of UDFs and Procedures already written. RegEx can be used in assemblies marked as SAFE as well meaning security is far less of a concern when using SQLCLR this way.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 9

Sean Lange (5/11/2012)
Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.

Didn't see it. Looks good actually as long as there aren't other numbers in the string that need to be kept. Spec did say # followed by numbers.

Sean Lange SSC Guru Points: 286609 More actions · Answer 10

Lynn Pettis (5/11/2012)
Sean Lange (5/11/2012)
Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.
Didn't see it. Looks good actually as long as there aren't other numbers in the string that need to be kept. Spec did say # followed by numbers.

Good point about the spec, guess it worked on the sample data so I figured it was good enough. 😉

If there are other numbers you need to keep and CLR is off the table then go with Lynn's approach. Of course CLR is still the easiest way to handle this.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 11

Sean Lange (5/11/2012)
Lynn Pettis (5/11/2012)
Sean Lange (5/11/2012)
Lynn, why not the replace I posted above? It is super fast because it doesn't have to do all sorts of string manipulation.
Didn't see it. Looks good actually as long as there aren't other numbers in the string that need to be kept. Spec did say # followed by numbers.
Good point about the spec, guess it worked on the sample data so I figured it was good enough. 😉
If there are other numbers you need to keep and CLR is off the table then go with Lynn's approach. Of course CLR is still the easiest way to handle this.

Actually, mine will fail if there are extra numbers. I just thought of that.

Lynn Pettis SSC Guru Points: 442470 More actions · Answer 12

This should work:

DECLARE @BadData TABLE(DataID INT, Data VARCHAR(50), ExpectedData VARCHAR(50));

INSERT INTO @BadData(DataID,Data,ExpectedData)

SELECT 1,'#123 data','data' UNION

SELECT 2,'#456data','data' UNION

SELECT 3,'data #789','data' UNION

SELECT 4,'data#012','data' UNION

SELECT 5,'data#345 some more data','data some more data' UNION

SELECT 6,'data #678some more data','data some more data' UNION

SELECT 7,'data #901 some more data','data some more data' UNION

SELECT 8,'data#234some more data','datasome more data' UNION

SELECT 9,'data#2some more data','datasome more data' UNION

SELECT 10, 'data#2 4 some more data','data 4 some more data';

--SELECT * FROM @BadData;

SELECT

LEFT(Data, PATINDEX('%#[0-9]%', Data) - 1)

+ RIGHT(Data,CASE WHEN PATINDEX('%[^0-9]%',SUBSTRING(Data, PATINDEX('%#[0-9]%', Data) + 1, DATALENGTH(Data))) = 0 THEN 0 ELSE DATALENGTH(Data) - (PATINDEX('%#[0-9]%', Data) + PATINDEX('%[^0-9]%',SUBSTRING(Data, PATINDEX('%#[0-9]%', Data) + 1, DATALENGTH(Data)))) + 1 END)

,Data,ExpectedData

FROM

@BadData;

StanVanDam SSC Journeyman Points: 81 More actions · Answer 13

Hi Lynn,

What if the source data contained the same pattern that is repeated many times? Example:

SELECT 1,'001N data 002N003N004N','data' UNION

SELECT 2,'005N005Ndata007N','data' UNION

SELECT 3,'data 008N009N010N','data ' UNION

SELECT 4,'data011Nsome012Nmore013Ndata014N','datasomemoredata' UNION

SELECT 5,'data015N some 016N017Nmore data','data some more data' UNION

SELECT 6,'data 050Nsome more 051N052Ndata','data some more data' UNION

SELECT 7,'data 090N some more data091N092N','data some more data' UNION

SELECT 8,'data093Nsome more 094N095Ndata','datasome more data' UNION

SELECT 9,'data096N some more097N098N data','data some more data' UNION

SELECT 10, 'data099N 4 some100N 5more 101Ndata','data 4 some 5more data';

Note that in my example, the string I want to remove is always 3 characters long, datatype Text, in the format of [0-9][0-9][0-9]N, ie. 001N, 002N, ... 999N, which repeats up to 999 times in each string, and no numbers repeat, they just increment upwards by 1. In addition, eliminating all but 1 space in between all words would also be helpful too. Thanks!

EDIT: I've solved the problem myself by creating a function as follows:

CREATE FUNCTION [dbo].[fnCleanN]

(

@BadString nvarchar(MAX)

)

RETURNS nvarchar(MAX)

AS

BEGIN

DECLARE @npos INTEGER

SELECT @npos = PATINDEX('%[0-9][0-9][0-9]N%', @BadString)

WHILE @npos > 0

BEGIN

SELECT @BadString = STUFF(@BadString, @npos, 4, '')

SELECT @npos = PATINDEX('%[0-9][0-9][0-9]N%', @BadString)

END

RETURN @BadString

END