Removing Non-Alphabetical Characters

  • I have a set of particularly cr@p last name data. The ops team use the last name field to add little indicators to give them information about the person in question. This is very useful for them and those out in the field who need this but it makes doing anything with this data very tricky. Going to the business and asking them to enter data properly is unfortunately not an option.

    What I'm looking for is something I can use to strip out these characters and give me just the names...But, there are a lot of names in there with special characters that are supposed to be there i.e. Mr O'Hara, Lady Double-Barrelled etc., so I need to keep those. To make things even trickier there are Mrs O'Boyle##'s and Lord Snooty-Smyth##'s in there and I need to strip the ## or whatever garbage has been appended from that but keep the ' and - that are legitimately there. I've included below what I've tried, along with the examples of combinations I've found so far. I'm pretty sure this is not going to be straightforward.

    create table #tmp_lastnames

    (

    LastName varchar(50)

    ,Problem varchar(100)

    )

    insert into #tmp_lastnames

    select 'JONES', 'Normal data'union all

    select 'Smith##', 'The most simple example'union all

    select 'Fearnley-Whittingstall','Works ok, includes both names and hyphen' union all

    select 'Cholmondeley-Warner##', 'Includes Cholmondeley but truncates at the hyphen'union all

    select 'O''Driscoll','Works ok'union all

    select 'O''Briain##','Works ok'union all

    select 'Porter\O''Dowd##','Includes Porter but truncates at the backslash'union all

    select 'O''Mara ##','Works ok'union all

    select 'O''Hara -##','Works ok'union all

    select 'O''Gara^^#D','Doesn''t work at all, the D at the end makes it fall over' union all

    select 'Cox##''','Works ok'

    select

    LastName

    ,CleanName =

    case

    when patindex('%[^a-z]%',reverse(lastname)) = 1

    then left(lastname,patindex('%[^a-z^'']%',lastname)-1)

    else lastname

    end

    ,Problem

    from #tmp_lastnames

    drop table #tmp_lastnames


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • I think regular expressions might be the "easiest" option.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • That's a phrase I've seen a lot around here but I wouldn't know one if I was served it for lunch 🙂 Are they something that we can do in SQL?


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • BWFC (10/9/2014)


    That's a phrase I've seen a lot around here but I wouldn't know one if I was served it for lunch 🙂 Are they something that we can do in SQL?

    Not out of the box.

    You can implement them through CLR. If you have MDS installed, there are some Regex functions available in the MDS database[/url].

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Here is a quick rCTE based solution, probably not the most efficient but easy for an SQL developer to understand I would presume;-)

    😎

    USE tempdb;

    GO

    SET NOCOUNT ON;

    GO

    ;WITH BASE_DATA(LN_ID,LastName,Problem) AS

    (SELECT * FROM (VALUES

    (1, 'JONES', 'Normal data')

    ,(2, 'Smith##', 'The most simple example')

    ,(3, 'Fearnley-Whittingstall','Works ok, includes both names and hyphen')

    ,(4, 'Cholmondeley-Warner##', 'Includes Cholmondeley but truncates at the hyphen')

    ,(5, 'O''Driscoll','Works ok')

    ,(6, 'O''Briain##','Works ok')

    ,(7, 'Porter\O''Dowd##','Includes Porter but truncates at the backslash')

    ,(8, 'O''Mara ##','Works ok')

    ,(9, 'O''Hara -##','Works ok')

    ,(10,'O''Gara^^#D','Doesn''t work at all, the D at the end makes it fall over')

    ,(11,'Cox##''','Works ok')) AS X(LN_ID,LastName,Problem)

    )

    ,UNWANTED_CHARS(CHAR_VAL,CHAR_REPL) AS

    (SELECT * FROM (VALUES

    ('#','')

    ,('\',' ')

    ,('/','')

    ,('''''','''')

    ,(' -','')

    ,('^','')

    ,('^^#',' ')

    ) AS X(CHAR_VAL,CHAR_REPL)

    )

    ,CLEAN_SET AS

    (

    SELECT

    1 AS LEVEL_NO

    ,BD.LN_ID

    ,CAST(BD.LastName AS VARCHAR(50)) AS LastName

    ,BD.Problem

    FROM BASE_DATA BD

    UNION ALL

    SELECT

    BD.LEVEL_NO + 1

    ,BD.LN_ID

    ,CASE

    WHEN CHARINDEX(UC.CHAR_VAL,BD.LastName) > 0 THEN CAST(REPLACE(BD.LastName,UC.CHAR_VAL,UC.CHAR_REPL) AS VARCHAR(50))

    ELSE BD.LastName

    END AS LastName

    ,BD.Problem

    FROM CLEAN_SET BD

    OUTER APPLY UNWANTED_CHARS UC

    WHERE CHARINDEX(UC.CHAR_VAL,BD.LastName) > 0

    )

    ,FINAL_SET AS

    (

    SELECT

    CT.LEVEL_NO

    ,CT.LN_ID

    ,ROW_NUMBER() OVER

    (

    PARTITION BY CT.LN_ID

    ORDER BY CT.LEVEL_NO DESC

    ) AS CT_RID

    ,CT.LastName

    ,CT.Problem

    FROM CLEAN_SET CT

    )

    SELECT

    FS.LN_ID

    ,FS.LastName

    ,FS.Problem

    FROM FINAL_SET FS

    WHERE FS.CT_RID = 1

    Results

    LN_ID LastName Problem

    ------ ----------------------- --------------------------------------------------------

    1 JONES Normal data

    2 Smith The most simple example

    3 Fearnley-Whittingstall Works ok, includes both names and hyphen

    4 Cholmondeley-Warner Includes Cholmondeley but truncates at the hyphen

    5 O'Driscoll Works ok

    6 O'Briain Works ok

    7 Porter O'Dowd Includes Porter but truncates at the backslash

    8 O'Mara Works ok

    9 O'Hara Works ok

    10 O'GaraD Doesn't work at all, the D at the end makes it fall over

    11 Cox' Works ok

  • Thanks Eirikur and Koen.

    I'm going to pursue the CTE approach for the time being I think. The solution Eirikur proposed is close but still returns incorrect characters. There's also no way of knowing what kind of rubbish I'll find in there, so hard-coding the characters I'm looking for is risky. With regards to the regex option I can't assume that I've got MDS installed, neither am I likely to be able to install it. I'm merely a BI guy and we have very little say in what does and does not get installed :angry:

    I'm not too concerned with efficiency for now. This is for a one-off load of a small set of existing data so if it's a bit slow I can live with it. When it comes to the day to day loading I'll be much more able to control what I have to deal with.


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • Quick though, you can play around with the order of values in the UNWANTED_CHARS, add some if missing and so on, should get you pretty close. As a general approach, add a key to UNWANTED_CHARS, enter the longest bad/search values first and add order by in the rCTE.

    😎

  • I just wanted to play around a little bit and this might be a good option to clean the data using an inline table-valued function.

    The function:

    CREATE FUNCTION dbo.CleanNames( @Name varchar(8000))

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@Name),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    SELECT replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    ltrim(

    replace(

    CAST((SELECT CASE WHEN SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 1) LIKE '[A-Za-z]'

    OR SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 2) LIKE '[''-][A-Za-z]'

    THEN SUBSTRING( @Name, N, 1) ELSE ' ' END

    FROM cteTally

    FOR XML PATH('')) AS varchar(8000)),

    ' ',' ')

    ),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')CleanName;

    GO

    How it can be used:

    SELECT

    LastName

    ,cn.CleanName

    ,Problem

    FROM #tmp_lastnames

    CROSS APPLY dbo.CleanNames( lastname) cn

    The results:

    LastName CleanName Problem

    ----------------------- ----------------------- ----------------------------------------------------------

    JONES JONES Normal data

    Smith## Smith The most simple example

    Fearnley-Whittingstall Fearnley-Whittingstall Works ok, includes both names and hyphen

    Cholmondeley-Warner## Cholmondeley-Warner Includes Cholmondeley but truncates at the hyphen

    O'Driscoll O'Driscoll Works ok

    O'Briain## O'Briain Works ok

    Porter\O'Dowd## Porter O'Dowd Includes Porter but truncates at the backslash

    O'Mara ## O'Mara Works ok

    O'Hara -## O'Hara Works ok

    O'Gara^^#D O'Gara D Doesn't work at all, the D at the end makes it fall over

    Cox##' Cox Works ok

    I didn't test the performance, but it shouldn't be too bad. 😛

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I think you were pretty close... You could do this:

    WITH removebadstuff(old,new) AS

    (

    SELECT LastName, replace(replace(replace(replace(replace(lastname,'#',''),'^',''),'/',''),'\',''),' -','')

    FROM #tmp_lastnames

    )

    SELECTold, -- old here for display only, not needed

    new =

    case patindex('%[a-z][^a-z]',new)

    when 0 then new

    else substring(new,1,patindex('%[a-z][^a-z]',new))

    end

    FROM removebadstuff;

    Which returns:

    oldnew

    -------------------------------------------------- ------------------------------------

    JONESJONES

    Smith## Smith

    Fearnley-WhittingstallFearnley-Whittingstall

    Cholmondeley-Warner##Cholmondeley-Warner

    O'DriscollO'Driscoll

    O'Briain##O'Briain

    Porter\O'Dowd##Porter O'Dowd

    O'Mara ##O'Mara

    O'Hara -##O'Hara

    O'Gara^^#DO'GaraD

    Cox##'Cox

    Edit: the alignment of the results pasted are jacked up. I tried to fix but made it worse so I gave up. You get my point though 😛

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • To avoid results like O'GaraD you're going to have to employ logic such as (possibly):

    Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.

  • Erin Ramsay (10/9/2014)


    To avoid results like O'GaraD you're going to have to employ logic such as (possibly):

    Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.

    My code handles that issue without a problem.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (10/9/2014)


    Erin Ramsay (10/9/2014)


    To avoid results like O'GaraD you're going to have to employ logic such as (possibly):

    Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid. So possibly find the position of the first valid character and substring to the last valid character. Too busy to play with it right now.

    My code handles that issue without a problem.

    Unfortunately Luis your code does return 'O'Gara D'. I also get an XML blank space reference after each name and in O'Gara's case between the final A and the rogue D. I think this is down to a difference in settings\collation(?) though.

    The other problem that everybody's had is that Porter\O'Dowd needs to keep the \. This is because the ops side often enter aliases like that rather than USING THE ALIAS BOX LIKE THEY'RE SUPPOSED TO!!!! [/rant]

    Any characters that occur BEORE or AFTER an invalid character are also explicitly invalid.

    I'm really not sure what you mean by that Erin. If any characters before OR after the invalid character are invalid, where are the valid characters?

    I starting to think that a one size fits all approach might not be the way to go.

    Thanks for all your efforts on this everybody.


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • I see what went wrong. The site converted the xml space tag into a space.:w00t:

    Other than that, could you explain the output that you need for "O'Gara^^#D"?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • It should just be O'Gara, there's quite a few names with Name #D in the list as it turns out.

    The good news is it appears it's not a big issue to take out the special characters. The data warehouse I'm basing my project on has all the special codes in there. Rightly or wrongly I'm taking that as a sign to ignore it and concentrate on the rest of the load procs etc.

    I'm going to carry on working on this as a side-side-project though. It's the kind of thing that will almost certainly be useful at some stage.


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • Maybe this version will help you.

    CREATE FUNCTION dbo.CleanNames( @Name varchar(8000))

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@Name),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    SELECT replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    ltrim(

    (SELECT CASE WHEN SUBSTRING( @Name COLLATE Latin1_General_Bin, N - 1, 2) LIKE '[^#][A-Za-z]'

    OR (SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 1) LIKE '[A-Za-z]' AND N = 1)

    OR SUBSTRING( @Name COLLATE Latin1_General_Bin, N, 2) LIKE '[''-\][A-Za-z]'

    THEN SUBSTRING( @Name, N, 1) ELSE ' ' END

    FROM cteTally

    FOR XML PATH(''),TYPE).value('.', 'varchar(8000)')

    ),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ') CleanName;

    GO

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply