Removing non-alphabetic characters from a column

  • I have a single column table C

    column is fname varchar(255)

    I want to remove dirty data ie anything that is

    not alphabetic char.

    I cannot do this with a replace as I dont know what the

    character is in advance.

    I need to do this with looping ideally using common table expressions

    Any idea?

  • You could use a function, built with PATINDEX

    Create Function [dbo].[RemoveAlphaCharacters](@Temp VarChar(1000))

    Returns VarChar(1000)

    AS

    Begin

    While PatIndex('%[a-z]%', @Temp) > 0

    Set @Temp = Stuff(@Temp, PatIndex('%[a-z]%', @Temp), 1, '')

    Return @Temp

    End

    Using value sjhrh1049jsjf303

    select dbo.removealphacharacters ('sjhrh1049jsjf303')

    Returns

    1049303

  • How do I do this for all the values in my column fname?

  • pass in your column name instead of the string

    select dbo.removealphacharacters (fname) from SomeTableWhichContainsAColumnCalledFname

  • sj999 (9/4/2012)


    I need to do this with looping ideally using common table expressions

    Any idea?

    Yes... you don't need typical looping for this. You need a high speed psuedo-cursor in the form of a CROSS JOIN with a Tally Table or a cteTally. This will also allow an all-in-one query that will allow you to build the function as an iTVF or "Inline Table Value Function" for a real boost in performance.

    First, see the article at the following link for what a Tally Table is and how it can be used to replace certain types of loops like this one.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    Next, build a permanent Tally Table. Like this...

    --===================================================================

    -- Create a Tally table from 1 to 11000

    --===================================================================

    --===== Create and populate the Tally table on the fly.

    SELECT TOP 11000

    IDENTITY(INT,1,1) AS N

    INTO dbo.Tally

    FROM Master.sys.ALL_Columns ac1

    CROSS JOIN Master.sys.ALL_Columns ac2

    ;

    --===== Add a CLUSTERED Primary Key to maximize performance

    ALTER TABLE dbo.Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    ;

    --===== Allow the general public to use it

    GRANT SELECT ON dbo.Tally TO PUBLIC

    ;

    GO

    Then, create the following general purpose string cleaner (Limited to VARCHAR(8000))...

    CREATE FUNCTION dbo.CleanString

    (@pString VARCHAR(8000),@pKeepPattern VARCHAR(8000))

    RETURNS TABLE

    AS

    RETURN

    SELECT CleanedString =

    (

    SELECT SUBSTRING(@pString, t.N, 1)

    FROM dbo.Tally t

    WHERE t.n <= LEN(@pString)

    AND SUBSTRING(@pString, t.N, 1) LIKE @pKeepPattern COLLATE Latin1_General_BIN

    ORDER BY t.n

    FOR XML PATH('')

    )

    ;

    After that, everything gets easy...

    SELECT yt.OriginalString,c.CleanedString

    FROM dbo.YourTable yt

    CROSS APPLY dbo.CleanString(yt.OriginalString,'[^A-Za-z]')

    ;

    That code will keep everything except the letters of the alphabet.

    If you want to keep only the numeric digets of 0 through 9, the call would be similar using a different LIKE filter.

    SELECT yt.OriginalString,c.CleanedString

    FROM dbo.YourTable yt

    CROSS APPLY dbo.CleanString(yt.OriginalString,'[0-9]')

    ;

    If you prefer to not use a Tally Table (it's really good for rapid and simple development), you can build the high speed no-reads equivalent of a Tally Table using a rendition of Itzek Ben-Gans cCTEs (Cascading CTEs) in the function like this. It's a bit more complicated but it produces virtually zero reads and is about as fast as using a permanent Tally Table.

    CREATE FUNCTION dbo.CleanString

    (@pString VARCHAR(8000),@pKeepPattern VARCHAR(8000))

    RETURNS TABLE

    AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    --===== Clean the string according to the "keep" pattern.

    SELECT CleanedString =

    (

    SELECT SUBSTRING(@pString, t.N, 1)

    FROM cteTally t

    WHERE t.n <= LEN(@pString)

    AND SUBSTRING(@pString, t.N, 1) LIKE @pKeepPattern COLLATE Latin1_General_BIN

    ORDER BY t.n

    FOR XML PATH('')

    )

    ;

    Whether you use the Tally Table version or the cteTally version, both are in a high performance iTVF. As you've seen, it requires a slightly different usage than a slower scalar function but it's worth it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply