Removing Non-Alphabetical Characters

Question

Removing Non-Alphabetical Characters

Neil Burton

SSC-Insane

Points: 23001
More actions
October 9, 2014 at 3:26 am

#297311

I have a set of particularly cr@p last name data. The ops team use the last name field to add little indicators to give them information about the person in question. This is very useful for them and those out in the field who need this but it makes doing anything with this data very tricky. Going to the business and asking them to enter data properly is unfortunately not an option.
What I'm looking for is something I can use to strip out these characters and give me just the names...But, there are a lot of names in there with special characters that are supposed to be there i.e. Mr O'Hara, Lady Double-Barrelled etc., so I need to keep those. To make things even trickier there are Mrs O'Boyle##'s and Lord Snooty-Smyth##'s in there and I need to strip the ## or whatever garbage has been appended from that but keep the ' and - that are legitimately there. I've included below what I've tried, along with the examples of combinations I've found so far. I'm pretty sure this is not going to be straightforward.
create table #tmp_lastnames
(
LastName varchar(50)
,Problem varchar(100)
)
insert into #tmp_lastnames
select 'JONES', 'Normal data'union all
select 'Smith##', 'The most simple example'union all
select 'Fearnley-Whittingstall','Works ok, includes both names and hyphen' union all
select 'Cholmondeley-Warner##', 'Includes Cholmondeley but truncates at the hyphen'union all
select 'O''Driscoll','Works ok'union all
select 'O''Briain##','Works ok'union all
select 'Porter\O''Dowd##','Includes Porter but truncates at the backslash'union all
select 'O''Mara ##','Works ok'union all
select 'O''Hara -##','Works ok'union all
select 'O''Gara^^#D','Doesn''t work at all, the D at the end makes it fall over' union all
select 'Cox##''','Works ok'
select
LastName
,CleanName =
case
when patindex('%[^a-z]%',reverse(lastname)) = 1
then left(lastname,patindex('%[^a-z^'']%',lastname)-1)
else lastname
end
,Problem
from #tmp_lastnames
drop table #tmp_lastnames
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Koen Verbeeck SSC Guru Points: 259105 More actions · Answer 1

I think regular expressions might be the "easiest" option.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Neil Burton SSC-Insane Points: 23001 More actions · Answer 2

That's a phrase I've seen a lot around here but I wouldn't know one if I was served it for lunch 🙂 Are they something that we can do in SQL?

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Koen Verbeeck SSC Guru Points: 259105 More actions · Answer 3

BWFC (10/9/2014)
That's a phrase I've seen a lot around here but I wouldn't know one if I was served it for lunch 🙂 Are they something that we can do in SQL?

Not out of the box.

You can implement them through CLR. If you have MDS installed, there are some Regex functions available in the MDS database[/url].

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 4

Here is a quick rCTE based solution, probably not the most efficient but easy for an SQL developer to understand I would presume;-)

😎

USE tempdb;

GO

SET NOCOUNT ON;

GO

;WITH BASE_DATA(LN_ID,LastName,Problem) AS

(SELECT * FROM (VALUES

(1, 'JONES', 'Normal data')

,(2, 'Smith##', 'The most simple example')

,(3, 'Fearnley-Whittingstall','Works ok, includes both names and hyphen')

,(4, 'Cholmondeley-Warner##', 'Includes Cholmondeley but truncates at the hyphen')

,(5, 'O''Driscoll','Works ok')

,(6, 'O''Briain##','Works ok')

,(7, 'Porter\O''Dowd##','Includes Porter but truncates at the backslash')

,(8, 'O''Mara ##','Works ok')

,(9, 'O''Hara -##','Works ok')

,(10,'O''Gara^^#D','Doesn''t work at all, the D at the end makes it fall over')

,(11,'Cox##''','Works ok')) AS X(LN_ID,LastName,Problem)

)

,UNWANTED_CHARS(CHAR_VAL,CHAR_REPL) AS

(SELECT * FROM (VALUES

('#','')

,('\',' ')

,('/','')

,('''''','''')

,(' -','')

,('^','')

,('^^#',' ')

) AS X(CHAR_VAL,CHAR_REPL)

)

,CLEAN_SET AS

(

SELECT

1 AS LEVEL_NO

,BD.LN_ID

,CAST(BD.LastName AS VARCHAR(50)) AS LastName

,BD.Problem

FROM BASE_DATA BD

UNION ALL

SELECT

BD.LEVEL_NO + 1

,BD.LN_ID

,CASE

WHEN CHARINDEX(UC.CHAR_VAL,BD.LastName) > 0 THEN CAST(REPLACE(BD.LastName,UC.CHAR_VAL,UC.CHAR_REPL) AS VARCHAR(50))

ELSE BD.LastName

END AS LastName

,BD.Problem

FROM CLEAN_SET BD

OUTER APPLY UNWANTED_CHARS UC

WHERE CHARINDEX(UC.CHAR_VAL,BD.LastName) > 0

)

,FINAL_SET AS

(

SELECT

CT.LEVEL_NO

,CT.LN_ID

,ROW_NUMBER() OVER

(

PARTITION BY CT.LN_ID

ORDER BY CT.LEVEL_NO DESC

) AS CT_RID

,CT.LastName

,CT.Problem

FROM CLEAN_SET CT

)

SELECT

FS.LN_ID

,FS.LastName

,FS.Problem

FROM FINAL_SET FS

WHERE FS.CT_RID = 1

Results

LN_ID LastName Problem

------ ----------------------- --------------------------------------------------------

1 JONES Normal data

2 Smith The most simple example

3 Fearnley-Whittingstall Works ok, includes both names and hyphen

4 Cholmondeley-Warner Includes Cholmondeley but truncates at the hyphen

5 O'Driscoll Works ok

6 O'Briain Works ok

7 Porter O'Dowd Includes Porter but truncates at the backslash

8 O'Mara Works ok

9 O'Hara Works ok

10 O'GaraD Doesn't work at all, the D at the end makes it fall over

11 Cox' Works ok

Neil Burton SSC-Insane Points: 23001 More actions · Answer 5

Thanks Eirikur and Koen.

I'm going to pursue the CTE approach for the time being I think. The solution Eirikur proposed is close but still returns incorrect characters. There's also no way of knowing what kind of rubbish I'll find in there, so hard-coding the characters I'm looking for is risky. With regards to the regex option I can't assume that I've got MDS installed, neither am I likely to be able to install it. I'm merely a BI guy and we have very little say in what does and does not get installed :angry:

I'm not too concerned with efficiency for now. This is for a one-off load of a small set of existing data so if it's a bit slow I can live with it. When it comes to the day to day loading I'll be much more able to control what I have to deal with.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Eirikur Eiriksson SSC Guru Points: 182912 More actions · Answer 6

Quick though, you can play around with the order of values in the UNWANTED_CHARS, add some if missing and so on, should get you pretty close. As a general approach, add a key to UNWANTED_CHARS, enter the longest bad/search values first and add order by in the rCTE.

😎

Luis Cazares SSC Guru Points: 183706 More actions · Answer 7

I just wanted to play around a little bit and this might be a good option to clean the data using an inline table-valued function.

The function:

CREATE FUNCTION dbo.CleanNames( @Name varchar(8000))

RETURNS TABLE WITH SCHEMABINDING

AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover NVARCHAR(4000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@Name),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

SELECT replace(

replace(

ltrim(

replace(

CAST((SELECT CASE WHEN SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 1) LIKE '[A-Za-z]'

OR SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 2) LIKE '[''-][A-Za-z]'

THEN SUBSTRING( @Name, N, 1) ELSE ' ' END

FROM cteTally

FOR XML PATH('')) AS varchar(8000)),

' ',' ')

),

' ',' '),

' ',' ')CleanName;

GO

How it can be used:

SELECT

LastName

,cn.CleanName

,Problem

FROM #tmp_lastnames

CROSS APPLY dbo.CleanNames( lastname) cn

The results:

LastName CleanName Problem

----------------------- ----------------------- ----------------------------------------------------------

JONES JONES Normal data

Smith## Smith The most simple example

Fearnley-Whittingstall Fearnley-Whittingstall Works ok, includes both names and hyphen

Cholmondeley-Warner## Cholmondeley-Warner Includes Cholmondeley but truncates at the hyphen

O'Driscoll O'Driscoll Works ok

O'Briain## O'Briain Works ok

Porter\O'Dowd## Porter O'Dowd Includes Porter but truncates at the backslash

O'Mara ## O'Mara Works ok

O'Hara -## O'Hara Works ok

O'Gara^^#D O'Gara D Doesn't work at all, the D at the end makes it fall over

Cox##' Cox Works ok

I didn't test the performance, but it shouldn't be too bad. 😛

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Alan Burstein SSC Guru Points: 61141 More actions · Answer 8

I think you were pretty close... You could do this:

WITH removebadstuff(old,new) AS

(

SELECT LastName, replace(replace(replace(replace(replace(lastname,'#',''),'^',''),'/',''),'\',''),' -','')

FROM #tmp_lastnames

)

SELECTold, -- old here for display only, not needed

new =

case patindex('%[a-z][^a-z]',new)

when 0 then new

else substring(new,1,patindex('%[a-z][^a-z]',new))

end

FROM removebadstuff;

Which returns:

oldnew

-------------------------------------------------- ------------------------------------

JONESJONES

Smith## Smith

Fearnley-WhittingstallFearnley-Whittingstall

Cholmondeley-Warner##Cholmondeley-Warner

O'DriscollO'Driscoll

O'Briain##O'Briain

Porter\O'Dowd##Porter O'Dowd

O'Mara ##O'Mara

O'Hara -##O'Hara

O'Gara^^#DO'GaraD

Cox##'Cox

Edit: the alignment of the results pasted are jacked up. I tried to fix but made it worse so I gave up. You get my point though 😛

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Erin Ramsay SSCrazy Eights Points: 9728 More actions · Answer 9

To avoid results like O'GaraD you're going to have to employ logic such as (possibly):

Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 10

Erin Ramsay (10/9/2014)
To avoid results like O'GaraD you're going to have to employ logic such as (possibly):
Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.

My code handles that issue without a problem.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Neil Burton SSC-Insane Points: 23001 More actions · Answer 11

Luis Cazares (10/9/2014)
Erin Ramsay (10/9/2014)
To avoid results like O'GaraD you're going to have to employ logic such as (possibly):
Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.
My code handles that issue without a problem.

Unfortunately Luis your code does return 'O'Gara D'. I also get an XML blank space reference after each name and in O'Gara's case between the final A and the rogue D. I think this is down to a difference in settings\collation(?) though.

The other problem that everybody's had is that Porter\O'Dowd needs to keep the \. This is because the ops side often enter aliases like that rather than USING THE ALIAS BOX LIKE THEY'RE SUPPOSED TO!!!! [/rant]

Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid.

I'm really not sure what you mean by that Erin. If any characters before OR after the invalid character are invalid, where are the valid characters?

I starting to think that a one size fits all approach might not be the way to go.

Thanks for all your efforts on this everybody.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Luis Cazares SSC Guru Points: 183706 More actions · Answer 12

I see what went wrong. The site converted the xml space tag into a space.:w00t:

Other than that, could you explain the output that you need for "O'Gara^^#D"?

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Neil Burton SSC-Insane Points: 23001 More actions · Answer 13

It should just be O'Gara, there's quite a few names with Name #D in the list as it turns out.

The good news is it appears it's not a big issue to take out the special characters. The data warehouse I'm basing my project on has all the special codes in there. Rightly or wrongly I'm taking that as a sign to ignore it and concentrate on the rest of the load procs etc.

I'm going to carry on working on this as a side-side-project though. It's the kind of thing that will almost certainly be useful at some stage.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Luis Cazares SSC Guru Points: 183706 More actions · Answer 14

Maybe this version will help you.

CREATE FUNCTION dbo.CleanNames( @Name varchar(8000))

RETURNS TABLE WITH SCHEMABINDING

AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover NVARCHAR(4000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@Name),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

SELECT replace(

replace(

ltrim(

(SELECT CASE WHEN SUBSTRING( @Name COLLATE Latin1_General_Bin, N - 1, 2) LIKE '[^#][A-Za-z]'

OR (SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 1) LIKE '[A-Za-z]' AND N = 1)

OR SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 2) LIKE '[''-\][A-Za-z]'

THEN SUBSTRING( @Name, N, 1) ELSE ' ' END

FROM cteTally

FOR XML PATH(''),TYPE).value('.', 'varchar(8000)')

),

' ',' '),

' ',' ') CleanName;

GO

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2