TSQL Fastest way to strip non-numeric characters from a string

  • Hi,

    I am looking for the fastest way to strip non-numeric characters from a string.

    I have a user database that has a column (USER_TELNO) in which the user can drop a telephone number (for example '+31 (0)12-123 456'). An extra computed column (FORMATTED_TELNO) should contain the formatted telephone number (31012123456 in the example)

    Note: the column FORMATTED_TELNO must be indexed, so the UDF in the computed column has WITH SCHEMABINDING.... I think this implicates that a CLR call won't work....

    Thanks!

  • Raymond van Laake (7/15/2015)


    Hi,

    I am looking for the fastest way to strip non-numeric characters from a string.

    I have a user database that has a column (USER_TELNO) in which the user can drop a telephone number (for example '+31 (0)12-123 456'). An extra computed column (FORMATTED_TELNO) should contain the formatted telephone number (31012123456 in the example)

    Note: the column FORMATTED_TELNO must be indexed, so the UDF in the computed column has WITH SCHEMABINDING.... I think this implicates that a CLR call won't work....

    Thanks!

    Have a look at this thread from last year

    😎

  • Yes, I saw older posts, was just wondering what fastest would be.

    I eventually came up with this based on older posts:

    ALTER FUNCTION [dbo].[__fx_test] (@countryNr INT, @telNr NCHAR(30))

    RETURNS CHAR(30)

    WITH SCHEMABINDING

    AS

    BEGIN

    DECLARE @fTelNr VARCHAR(30) = ''

    SELECT @fTelNr = @fTelNr + SUBSTRING(@telNr,N,1)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)) AS x(N)

    WHERE N<=LEN(@telNr)

    AND SUBSTRING(@telNr,N,1) LIKE ('[0-9]')

    RETURN @fTelNr

    END

  • Raymond van Laake (7/15/2015)


    Yes, I saw older posts, was just wondering what fastest would be.

    I eventually came up with this based on older posts:

    ALTER FUNCTION [dbo].[__fx_test] (@countryNr INT, @telNr NCHAR(30))

    RETURNS CHAR(30)

    WITH SCHEMABINDING

    AS

    BEGIN

    DECLARE @fTelNr VARCHAR(30) = ''

    SELECT @fTelNr = @fTelNr + SUBSTRING(@telNr,N,1)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)) AS x(N)

    WHERE N<=LEN(@telNr)

    AND SUBSTRING(@telNr,N,1) LIKE ('[0-9]')

    RETURN @fTelNr

    END

    Since you're looking for the fastest way, might I suggest a couple of small improvements:

    1. You have this coded as a scalar function. An inline table-valued function will almost always perform better (usually significantly). This would require changing your query calling the function to CROSS APPLY to this function, instead of calling it in the column list (see my blog post[/url] showing how an inline function works better).

    2. Even though it is unlikely, since the query doesn't have an ORDER BY clause, the numbers could possibly be returned out of order. No ORDER BY clause = no guarantee of order.

    3. String concatenation in Windows is notoriously slow. Better would be to:

    4. Using FOR XML PATH(''), TYPE is usually the quickest way to create strings.

    Based upon all of that, I suggest:

    CREATE FUNCTION dbo.[__fx_test] (@countryNr INT, @telNr NCHAR(30))

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    SELECT (SELECT SUBSTRING(@telNr,N,1)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)) AS x(N)

    WHERE N<=LEN(@telNr)

    AND SUBSTRING(@telNr,N,1) LIKE ('[0-9]')

    ORDER BY N

    FOR XML PATH(''), TYPE).value('.','NCHAR(30)')

    GO

    You can read about how this works in my SSC article[/url]

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Wow... thanks for the improvements!

    Raymond

  • I'm glad that you can use it.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Wayne,

    Why NCHAR(30) instead of VARCHAR(30) ? For a phone number, it makes no sense to waste the extra byte per digit when you only need to account for 0 thru 9, and you certainly need not waste the space on all the additional characters out to 30 for each and every record.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Just piling on a little, here is a test harness with three different versions

    😎

    Echoing Wain's reply, don't recommend using the variable type concatenation as the order is not guarantied (but it is faster than XML concatenation )

    USE tempdb;

    GO

    SET NOCOUNT ON;

    IF OBJECT_ID(N'dbo.TEST_CLEAN_NUM_FUNC') IS NOT NULL DROP TABLE dbo.TEST_CLEAN_NUM_FUNC;

    CREATE TABLE dbo.TEST_CLEAN_NUM_FUNC

    (

    TCNF_ID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DBO_TEST_CLEAN_NUM_FUNC_TCFN_ID PRIMARY KEY CLUSTERED

    ,TCNF_NUMSTR VARCHAR(30) NOT NULL

    )

    ;

    GO

    IF OBJECT_ID(N'dbo._iTVFN_tEEst') IS NOT NULL DROP FUNCTION dbo._iTVFN_tEEst;

    GO

    CREATE FUNCTION [dbo]._iTVFN_tEEst (@countryNr INT, @telNr NCHAR(30))

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    SELECT

    (

    SELECT SUBSTRING(@telNr,N,1)

    FROM (

    SELECT TOP(LEN(@telNr)) N

    FROM (VALUES (1) ,(2) ,(3) ,(4) ,(5) ,(6) ,(7) ,(8) ,(9) ,(10)

    ,(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)

    ,(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)

    ) AS Y(N)

    ) AS X(N)

    WHERE ((ASCII(SUBSTRING(@telNr,X.N,1)) - 48) & 0x7FFF) < 10

    FOR XML PATH(''),TYPE).value('.','NCHAR(30)'

    ) AS STR_OUT

    ;

    GO

    IF OBJECT_ID(N'dbo.__fx_tEEst') IS NOT NULL DROP FUNCTION dbo.__fx_tEEst;

    GO

    CREATE FUNCTION [dbo].__fx_tEEst (@countryNr INT, @telNr NCHAR(30))

    RETURNS CHAR(30)

    WITH SCHEMABINDING

    AS

    BEGIN

    DECLARE @fTelNr VARCHAR(30) = ''

    SELECT @fTelNr = @fTelNr + SUBSTRING(@telNr,N,1)

    FROM (

    SELECT TOP(LEN(@telNr)) N

    FROM (VALUES (1) ,(2) ,(3) ,(4) ,(5) ,(6) ,(7) ,(8) ,(9) ,(10)

    ,(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)

    ,(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)

    ) AS Y(N)

    ) AS X(N)

    WHERE ((ASCII(SUBSTRING(@telNr,X.N,1)) - 48) & 0x7FFF) < 10

    RETURN @fTelNr

    END

    GO

    IF OBJECT_ID(N'dbo.__fx_test') IS NOT NULL DROP FUNCTION dbo.__fx_test;

    GO

    CREATE FUNCTION [dbo].[__fx_test] (@countryNr INT, @telNr NCHAR(30))

    RETURNS CHAR(30)

    WITH SCHEMABINDING

    AS

    BEGIN

    DECLARE @fTelNr VARCHAR(30) = ''

    SELECT @fTelNr = @fTelNr + SUBSTRING(@telNr,N,1)

    FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)) AS x(N)

    WHERE N<=LEN(@telNr)

    AND SUBSTRING(@telNr,N,1) LIKE ('[0-9]')

    RETURN @fTelNr

    END

    GO

    DECLARE @SAMPLE_NUMS TABLE (SN_NUM VARCHAR(8))

    DECLARE @SAMPLE_DIGIT TABLE (SD_DIGIT VARCHAR(2))

    INSERT INTO @SAMPLE_NUMS (SN_NUM)

    SELECT

    CONVERT(VARCHAR(8),ABS(SAC.object_id) % 10000000,0)

    FROM sys.all_columns SAC

    INSERT INTO @SAMPLE_DIGIT(SD_DIGIT)

    VALUES

    ('(')

    ,(')')

    ,('-')

    ,(' ')

    ,('-(')

    ,(')-')

    ,(',')

    ,('.')

    ,('/')

    ,('@');

    INSERT INTO dbo.TEST_CLEAN_NUM_FUNC(TCNF_NUMSTR)

    SELECT TOP 1000000

    S1.SD_DIGIT + N1.SN_NUM + S2.SD_DIGIT + N3.SN_NUM + S3.SD_DIGIT

    FROM @SAMPLE_DIGIT S1

    CROSS JOIN @SAMPLE_NUMS N1

    CROSS JOIN @SAMPLE_DIGIT S2

    CROSS JOIN @SAMPLE_NUMS N3

    CROSS JOIN @SAMPLE_DIGIT S3;

    DECLARE @CHAR_BUCKET VARCHAR(30) = '';

    DECLARE @timer TABLE (T_TEXT VARCHAR(50) NOT NULL, T_TIME DATETIME2(7) NOT NULL DEFAULT(SYSDATETIME()));

    INSERT INTO @timer(T_TEXT) VALUES('DRY RUN');

    SELECT

    @CHAR_BUCKET = TF.TCNF_NUMSTR

    FROM dbo.TEST_CLEAN_NUM_FUNC TF;

    INSERT INTO @timer(T_TEXT) VALUES('DRY RUN');

    INSERT INTO @timer(T_TEXT) VALUES('[dbo].[__fx_test]');

    SELECT

    @CHAR_BUCKET = dbo.__fx_test('123', TF.TCNF_NUMSTR)

    FROM dbo.TEST_CLEAN_NUM_FUNC TF;

    INSERT INTO @timer(T_TEXT) VALUES('[dbo].[__fx_test]');

    INSERT INTO @timer(T_TEXT) VALUES('[dbo].[__fx_tEEst]');

    SELECT

    @CHAR_BUCKET = dbo.__fx_tEEst('123', TF.TCNF_NUMSTR)

    FROM dbo.TEST_CLEAN_NUM_FUNC TF;

    INSERT INTO @timer(T_TEXT) VALUES('[dbo].[__fx_tEEst]');

    INSERT INTO @timer(T_TEXT) VALUES('dbo._iTVFN_tEEst');

    SELECT

    @CHAR_BUCKET = X.STR_OUT

    FROM dbo.TEST_CLEAN_NUM_FUNC TF

    CROSS APPLY _iTVFN_tEEst('123', TF.TCNF_NUMSTR) AS X;

    INSERT INTO @timer(T_TEXT) VALUES('dbo._iTVFN_tEEst');

    SELECT

    T.T_TEXT

    ,DATEDIFF(MICROSECOND,MIN(T.T_TIME),MAX(T.T_TIME)) AS DURATION

    FROM @timer T

    GROUP BY T.T_TEXT

    ORDER BY DURATION;

    Results

    T_TEXT DURATION

    -------------------------------------------------- -----------

    DRY RUN 147009

    [dbo].[__fx_tEEst] 11648666

    dbo._iTVFN_tEEst 16361936

    [dbo].[__fx_test] 17223985

  • For the loop it should be the same:

    DECLARE @a CHAR(30)

    DECLARE @b-2 VARCHAR(30)

    SET @a = 'hello'

    SET @b-2 = 'hello'

    SELECT LEN(@a), LEN(@b)

    results in 5 on both variables

  • Eirikur,

    I changed your (very nice) test harness to put the results into a temp table vs. using @CHAR_BUCKET. My results are:

    T_TEXT DURATION

    -------------------------------------------------- -----------

    DRY RUN 194011

    dbo._iTVFN_tEEst 2383161

    [dbo].[__fx_tEEst] 13615779

    [dbo].[__fx_test] 20967199

    Which shows the ITVF to be appreciably faster.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Raymond van Laake (7/15/2015)


    For the loop it should be the same:

    DECLARE @a CHAR(30)

    DECLARE @b-2 VARCHAR(30)

    SET @a = 'hello'

    SET @b-2 = 'hello'

    SELECT LEN(@a), LEN(@b)

    results in 5 on both variables

    That's not where the concern is. It's the extra storage for the table to hold 30 characters instead of some lesser but variable number, and doubled because of NCHAR instead of CHAR.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson (7/15/2015)


    Wayne,

    Why NCHAR(30) instead of VARCHAR(30) ? For a phone number, it makes no sense to waste the extra byte per digit when you only need to account for 0 thru 9, and you certainly need not waste the space on all the additional characters out to 30 for each and every record.

    Steve,

    You're right... I just grabbed the datatype that was being passed in, and didn't even think about it.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (7/15/2015)


    Eirikur,

    I changed your (very nice) test harness to put the results into a temp table vs. using @CHAR_BUCKET. My results are:

    T_TEXT DURATION

    -------------------------------------------------- -----------

    DRY RUN 194011

    dbo._iTVFN_tEEst 2383161

    [dbo].[__fx_tEEst] 13615779

    [dbo].[__fx_test] 20967199

    Which shows the ITVF to be appreciably faster.

    Nice touch;-), thanks Wayne!

    😎

  • What's the final code now?

  • Raymond van Laake (7/15/2015)


    What's the final code now?

    This one here is by far the fastest as Wayne's test clearly shows

    😎

    IF OBJECT_ID(N'dbo._iTVFN_tEEst') IS NOT NULL DROP FUNCTION dbo._iTVFN_tEEst;

    GO

    CREATE FUNCTION [dbo]._iTVFN_tEEst (@countryNr INT, @telNr NCHAR(30))

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    SELECT

    (

    SELECT SUBSTRING(@telNr,N,1)

    FROM (

    SELECT TOP(LEN(@telNr)) N

    FROM (VALUES (1) ,(2) ,(3) ,(4) ,(5) ,(6) ,(7) ,(8) ,(9) ,(10)

    ,(11),(12),(13),(14),(15),(16),(17),(18),(19),(20)

    ,(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)

    ) AS Y(N)

    ) AS X(N)

    WHERE ((ASCII(SUBSTRING(@telNr,X.N,1)) - 48) & 0x7FFF) < 10

    FOR XML PATH(''),TYPE).value('.','NCHAR(30)'

    ) AS STR_OUT

    ;

    GO

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply