October 9, 2014 at 3:26 am
I have a set of particularly cr@p last name data. The ops team use the last name field to add little indicators to give them information about the person in question. This is very useful for them and those out in the field who need this but it makes doing anything with this data very tricky. Going to the business and asking them to enter data properly is unfortunately not an option.
What I'm looking for is something I can use to strip out these characters and give me just the names...But, there are a lot of names in there with special characters that are supposed to be there i.e. Mr O'Hara, Lady Double-Barrelled etc., so I need to keep those. To make things even trickier there are Mrs O'Boyle##'s and Lord Snooty-Smyth##'s in there and I need to strip the ## or whatever garbage has been appended from that but keep the ' and - that are legitimately there. I've included below what I've tried, along with the examples of combinations I've found so far. I'm pretty sure this is not going to be straightforward.
create table #tmp_lastnames
(
LastName varchar(50)
,Problem varchar(100)
)
insert into #tmp_lastnames
select 'JONES', 'Normal data'union all
select 'Smith##', 'The most simple example'union all
select 'Fearnley-Whittingstall','Works ok, includes both names and hyphen' union all
select 'Cholmondeley-Warner##', 'Includes Cholmondeley but truncates at the hyphen'union all
select 'O''Driscoll','Works ok'union all
select 'O''Briain##','Works ok'union all
select 'Porter\O''Dowd##','Includes Porter but truncates at the backslash'union all
select 'O''Mara ##','Works ok'union all
select 'O''Hara -##','Works ok'union all
select 'O''Gara^^#D','Doesn''t work at all, the D at the end makes it fall over' union all
select 'Cox##''','Works ok'
select
LastName
,CleanName =
case
when patindex('%[^a-z]%',reverse(lastname)) = 1
then left(lastname,patindex('%[^a-z^'']%',lastname)-1)
else lastname
end
,Problem
from #tmp_lastnames
drop table #tmp_lastnames
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
October 9, 2014 at 4:19 am
I think regular expressions might be the "easiest" option.
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
October 9, 2014 at 4:26 am
That's a phrase I've seen a lot around here but I wouldn't know one if I was served it for lunch 🙂 Are they something that we can do in SQL?
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
October 9, 2014 at 4:50 am
BWFC (10/9/2014)
That's a phrase I've seen a lot around here but I wouldn't know one if I was served it for lunch 🙂 Are they something that we can do in SQL?
Not out of the box.
You can implement them through CLR. If you have MDS installed, there are some Regex functions available in the MDS database[/url].
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
October 9, 2014 at 6:29 am
Here is a quick rCTE based solution, probably not the most efficient but easy for an SQL developer to understand I would presume;-)
😎
USE tempdb;
GO
SET NOCOUNT ON;
GO
;WITH BASE_DATA(LN_ID,LastName,Problem) AS
(SELECT * FROM (VALUES
(1, 'JONES', 'Normal data')
,(2, 'Smith##', 'The most simple example')
,(3, 'Fearnley-Whittingstall','Works ok, includes both names and hyphen')
,(4, 'Cholmondeley-Warner##', 'Includes Cholmondeley but truncates at the hyphen')
,(5, 'O''Driscoll','Works ok')
,(6, 'O''Briain##','Works ok')
,(7, 'Porter\O''Dowd##','Includes Porter but truncates at the backslash')
,(8, 'O''Mara ##','Works ok')
,(9, 'O''Hara -##','Works ok')
,(10,'O''Gara^^#D','Doesn''t work at all, the D at the end makes it fall over')
,(11,'Cox##''','Works ok')) AS X(LN_ID,LastName,Problem)
)
,UNWANTED_CHARS(CHAR_VAL,CHAR_REPL) AS
(SELECT * FROM (VALUES
('#','')
,('\',' ')
,('/','')
,('''''','''')
,(' -','')
,('^','')
,('^^#',' ')
) AS X(CHAR_VAL,CHAR_REPL)
)
,CLEAN_SET AS
(
SELECT
1 AS LEVEL_NO
,BD.LN_ID
,CAST(BD.LastName AS VARCHAR(50)) AS LastName
,BD.Problem
FROM BASE_DATA BD
UNION ALL
SELECT
BD.LEVEL_NO + 1
,BD.LN_ID
,CASE
WHEN CHARINDEX(UC.CHAR_VAL,BD.LastName) > 0 THEN CAST(REPLACE(BD.LastName,UC.CHAR_VAL,UC.CHAR_REPL) AS VARCHAR(50))
ELSE BD.LastName
END AS LastName
,BD.Problem
FROM CLEAN_SET BD
OUTER APPLY UNWANTED_CHARS UC
WHERE CHARINDEX(UC.CHAR_VAL,BD.LastName) > 0
)
,FINAL_SET AS
(
SELECT
CT.LEVEL_NO
,CT.LN_ID
,ROW_NUMBER() OVER
(
PARTITION BY CT.LN_ID
ORDER BY CT.LEVEL_NO DESC
) AS CT_RID
,CT.LastName
,CT.Problem
FROM CLEAN_SET CT
)
SELECT
FS.LN_ID
,FS.LastName
,FS.Problem
FROM FINAL_SET FS
WHERE FS.CT_RID = 1
Results
LN_ID LastName Problem
------ ----------------------- --------------------------------------------------------
1 JONES Normal data
2 Smith The most simple example
3 Fearnley-Whittingstall Works ok, includes both names and hyphen
4 Cholmondeley-Warner Includes Cholmondeley but truncates at the hyphen
5 O'Driscoll Works ok
6 O'Briain Works ok
7 Porter O'Dowd Includes Porter but truncates at the backslash
8 O'Mara Works ok
9 O'Hara Works ok
10 O'GaraD Doesn't work at all, the D at the end makes it fall over
11 Cox' Works ok
October 9, 2014 at 6:48 am
Thanks Eirikur and Koen.
I'm going to pursue the CTE approach for the time being I think. The solution Eirikur proposed is close but still returns incorrect characters. There's also no way of knowing what kind of rubbish I'll find in there, so hard-coding the characters I'm looking for is risky. With regards to the regex option I can't assume that I've got MDS installed, neither am I likely to be able to install it. I'm merely a BI guy and we have very little say in what does and does not get installed :angry:
I'm not too concerned with efficiency for now. This is for a one-off load of a small set of existing data so if it's a bit slow I can live with it. When it comes to the day to day loading I'll be much more able to control what I have to deal with.
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
October 9, 2014 at 6:55 am
Quick though, you can play around with the order of values in the UNWANTED_CHARS, add some if missing and so on, should get you pretty close. As a general approach, add a key to UNWANTED_CHARS, enter the longest bad/search values first and add order by in the rCTE.
😎
October 9, 2014 at 8:52 am
I just wanted to play around a little bit and this might be a good option to clean the data using an inline table-valued function.
The function:
CREATE FUNCTION dbo.CleanNames( @Name varchar(8000))
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@Name),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
SELECT replace(
replace(
replace(
replace(
replace(
replace(
replace(
ltrim(
replace(
CAST((SELECT CASE WHEN SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 1) LIKE '[A-Za-z]'
OR SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 2) LIKE '[''-][A-Za-z]'
THEN SUBSTRING( @Name, N, 1) ELSE ' ' END
FROM cteTally
FOR XML PATH('')) AS varchar(8000)),
' ',' ')
),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ')CleanName;
GO
How it can be used:
SELECT
LastName
,cn.CleanName
,Problem
FROM #tmp_lastnames
CROSS APPLY dbo.CleanNames( lastname) cn
The results:
LastName CleanName Problem
----------------------- ----------------------- ----------------------------------------------------------
JONES JONES Normal data
Smith## Smith The most simple example
Fearnley-Whittingstall Fearnley-Whittingstall Works ok, includes both names and hyphen
Cholmondeley-Warner## Cholmondeley-Warner Includes Cholmondeley but truncates at the hyphen
O'Driscoll O'Driscoll Works ok
O'Briain## O'Briain Works ok
Porter\O'Dowd## Porter O'Dowd Includes Porter but truncates at the backslash
O'Mara ## O'Mara Works ok
O'Hara -## O'Hara Works ok
O'Gara^^#D O'Gara D Doesn't work at all, the D at the end makes it fall over
Cox##' Cox Works ok
I didn't test the performance, but it shouldn't be too bad. 😛
October 9, 2014 at 10:00 am
I think you were pretty close... You could do this:
WITH removebadstuff(old,new) AS
(
SELECT LastName, replace(replace(replace(replace(replace(lastname,'#',''),'^',''),'/',''),'\',''),' -','')
FROM #tmp_lastnames
)
SELECTold, -- old here for display only, not needed
new =
case patindex('%[a-z][^a-z]',new)
when 0 then new
else substring(new,1,patindex('%[a-z][^a-z]',new))
end
FROM removebadstuff;
Which returns:
oldnew
-------------------------------------------------- ------------------------------------
JONESJONES
Smith## Smith
Fearnley-WhittingstallFearnley-Whittingstall
Cholmondeley-Warner##Cholmondeley-Warner
O'DriscollO'Driscoll
O'Briain##O'Briain
Porter\O'Dowd##Porter O'Dowd
O'Mara ##O'Mara
O'Hara -##O'Hara
O'Gara^^#DO'GaraD
Cox##'Cox
Edit: the alignment of the results pasted are jacked up. I tried to fix but made it worse so I gave up. You get my point though 😛
-- Itzik Ben-Gan 2001
October 9, 2014 at 2:41 pm
To avoid results like O'GaraD you're going to have to employ logic such as (possibly):
Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.
October 9, 2014 at 2:55 pm
Erin Ramsay (10/9/2014)
To avoid results like O'GaraD you're going to have to employ logic such as (possibly):Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.
My code handles that issue without a problem.
October 10, 2014 at 1:51 am
Luis Cazares (10/9/2014)
Erin Ramsay (10/9/2014)
To avoid results like O'GaraD you're going to have to employ logic such as (possibly):Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.
My code handles that issue without a problem.
Unfortunately Luis your code does return 'O'Gara D'. I also get an XML blank space reference after each name and in O'Gara's case between the final A and the rogue D. I think this is down to a difference in settings\collation(?) though.
The other problem that everybody's had is that Porter\O'Dowd needs to keep the \. This is because the ops side often enter aliases like that rather than USING THE ALIAS BOX LIKE THEY'RE SUPPOSED TO!!!! [/rant]
Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid.
I'm really not sure what you mean by that Erin. If any characters before OR after the invalid character are invalid, where are the valid characters?
I starting to think that a one size fits all approach might not be the way to go.
Thanks for all your efforts on this everybody.
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
October 10, 2014 at 10:42 am
I see what went wrong. The site converted the xml space tag into a space.:w00t:
Other than that, could you explain the output that you need for "O'Gara^^#D"?
October 10, 2014 at 1:03 pm
It should just be O'Gara, there's quite a few names with Name #D in the list as it turns out.
The good news is it appears it's not a big issue to take out the special characters. The data warehouse I'm basing my project on has all the special codes in there. Rightly or wrongly I'm taking that as a sign to ignore it and concentrate on the rest of the load procs etc.
I'm going to carry on working on this as a side-side-project though. It's the kind of thing that will almost certainly be useful at some stage.
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537
October 10, 2014 at 4:38 pm
Maybe this version will help you.
CREATE FUNCTION dbo.CleanNames( @Name varchar(8000))
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@Name),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
SELECT replace(
replace(
replace(
replace(
replace(
replace(
replace(
ltrim(
(SELECT CASE WHEN SUBSTRING( @Name COLLATE Latin1_General_Bin, N - 1, 2) LIKE '[^#][A-Za-z]'
OR (SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 1) LIKE '[A-Za-z]' AND N = 1)
OR SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 2) LIKE '[''-\][A-Za-z]'
THEN SUBSTRING( @Name, N, 1) ELSE ' ' END
FROM cteTally
FOR XML PATH(''),TYPE).value('.', 'varchar(8000)')
),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' '),
' ',' ') CleanName;
GO
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply