How do I extract numbers from a text field?

  • Hi All

    I have a text field in a table that allows free form text.

    The users will often put in 8 digit account codes in amongst the text in that field. Each record may contain 0, 1, 2 or 3 account codes. There may also be other numeric codes that do not conform to the account code standard.

    I also know that all account codes start with a "6".

    Please, can anyone help me with some code to extract these account codes?

    Many thanks

    K

  • What is the data type of the "field"? And, yes, it will make a difference.

    Also, what is the maximum length of the data in the "field"? It could make a difference.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff

    It is a varchar field of length 90.

    regards

    K

  • Would you please provide a (few) sample line(s) and your expected result?

    Would make it easier for some of than talking plain theory...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi Lutz

    3 example records to be filtered:

    64112358 and 65888999

    64199855 & 88859975 & 65889972

    there are 4984118 and X4481566 and 69982214

    --------

    The expected results are:

    64112358, 65888999

    64199855, 65889972

    69982214

    ---------

    Any help with the above examples would be most appreciated.

    regards

    K

  • Attached please find a sample code.

    Some notes:

    If you already have a unique value (like an ID) assigned to each row, the first CTE can be omitted.

    Also, if you can guarantee, that each row will have at least one matching value, you could use the 3rd CTE as your final output, omitting the last SELECT.

    Finally, I don't know how many rows you're going to process. It might help performance to store the results from cte2 in a (intermediate) temp table.

    If you don't have a string split function yet, please search this side for it. There are several version available.

    SET nocount ON

    DECLARE @t TABLE (col varchar(90))

    INSERT INTO @t SELECT '64112358 and 65888999'

    INSERT INTO @t SELECT '64199855 & 88859975 & 65889972'

    INSERT INTO @t SELECT 'there are 4984118 and X4481566 and 69982214'

    INSERT INTO @t SELECT '699821E1' -- added by myself to get an empty row

    -- assign a ID for each string . Might be obsolete if ID already available

    ;WITH cte1 AS

    (SELECT row_number() over(order BY col) id, col

    FROM @t t

    ),

    -- split the string using a split string function, applied to each row of the source table

    cte2 AS

    (SELECT id, item

    FROM cte1

    CROSS apply dbo.DelimitedSplit(cte1.col,' ')

    ),

    -- re-concatenate the string with matching numbers only, might result in NULL values if one row doesn't have a matching number in it

    cte3 AS

    (SELECT

    stuff(

    (

    SELECT ', ' + Item

    FROM cte2 t2

    WHERE t2.id = t1.id

    AND t2.item LIKE '6' + replicate('[0-9]', 7)

    FOR XML PATH('')

    ),1,2,'') result

    FROM cte2 t1

    GROUP BY id

    )

    -- final select, elminating NULL values

    SELECT * FROM cte3 WHERE result IS NOT NULL

    /* result set

    64112358, 65888999

    64199855, 65889972

    69982214*/



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (11/14/2009)


    If you don't have a string split function yet, please search this side for it. There are several version available.

    Heh... Lutz... why don't you just provide the one you used to test the code above with? 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff:

    Well, I didn't post it for three reasons:

    1) The function I'm using is mostly copied from one of the articles I referred to several dozen times in other posts so I felt I shouldn't post code I didn't came up with :blush:,

    2) this function is using a Tally table and posting the code to get this one I really didn't felt like since it would be "slightly off topic", and

    3) the OP now has a solution to his/her problem and I thought he/she could spend some time doing research on string split functions. Beside of being fun to try to understand all their various methods I think it's really educating. 🙂

    Anyway, since my "split string tutor" ( 😉 ) is asking me to post what he taught me, here it is:

    CREATE FUNCTION [dbo].[DelimitedSplit] (

    @pString VARCHAR(max),

    @pDelimiter CHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    SELECT

    row_number() OVER (ORDER BY N) AS ItemID,

    SUBSTRING(@pDelimiter + @pString + @pDelimiter,

    N + 1,

    CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter, N + 1) - N - 1

    ) AS Item

    FROM

    Tally

    WHERE

    N < LEN(@pDelimiter + @pString + @pDelimiter)

    AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter

    The changes I made to the code you posted in your Tally table article are made to use it as ITVF.

    Side note: The CREATE FUNCTION chapter in BOL is the one thats highlighted when you search for "create FUN" 😀



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi Lutz, Jeff

    Thanks very much for your help.

    You have introduced me to some new concepts and techniques.

    I will give it a go, and see how it pans out.

    Thanks again.

    K

  • Your welcome K... if you really want to learn how the Tally table works to replace WHILE Loops like Lutz did above, click on the TALLY table link in his signature line.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/14/2009)


    Your welcome K... if you really want to learn how the Tally table works to replace WHILE Loops like Lutz did above, click on the TALLY table link in his signature line.

    Hey Jeff, how does it feel to point somebody to your own article referenced in the signature of someone else? 😀

    Note: How did I do in terms of changing your string split function? Passed or failed?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • @Jeff:

    Well, I didn't post it for three reasons:

    Heh... no worries, Lutz. It just struck me as odd because you normally cough up a whole solution. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ... and the split function is just fine. I've tweeked it with some optimizations over time... I'll be right back to show you...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • here's the optimizations I was talking about. There are two different types of optimizations... the first one is to reduce the number of concatenations and other calculations as much as possible. You'll see that in the function code I've included.

    The second "optimization" isn't really an optimization... it's a realization of how much the use of VARCHAR(MAX) sucks the life out of code. My speculation on that was that VARCHAR(MAX) is always treated as an "out of row" datatype but it was just a speculation. Others have insisted that VARCHAR(MAX) will be "in row" if it fits. The bottom line is, whatever... VARCHAR(MAX) performs much slower than otherwise equivalent code that uses VARCHAR(8000). In English, don't use VARCHAR(MAX) if VARCHAR(8000) will do.

    To commence, here's my normal million row test table with some code that builds 6 randomly generated INT's separated by commas... the rest of the columns are most just because this code is used for a lot of different tests and they also give a little bulk to the table as if it were a real table... for all those that wish to "play along", this doesn't take very long to build (49 seconds on my 7 year old, single 1.8 GHz desktop box).

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 100,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeCSV" contains a random 6 element CSV.

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    SomeID = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST(

    CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','

    + CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','

    + CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','

    + CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','

    + CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','

    + CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))

    AS VARCHAR(8000)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN

    Master.dbo.SysColumns t2

    --===== Create a clustered primary key.

    -- Takes about 1 second to execute.

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (SomeID)

    Here's Lutz's good function... notice the concatenation that puts a delimiter on both ends of the data being split. There's also a fair bit of unit addition and subtraction on "N". That's the way I used to do it, as well. It does make things a little bit easier to understand if you have to troubleshoot the code, but it also makes it a fair bit slower over, say, 10,000 rows of data to split. Also notice that the string to be split is declared as a VARCHAR(MAX)...

    --===== Lutz' Splitter Function from above...

    CREATE FUNCTION [dbo].[DelimitedSplit] (

    @pString VARCHAR(max),

    @pDelimiter CHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    SELECT

    row_number() OVER (ORDER BY N) AS ItemID,

    SUBSTRING( @pDelimiter + @pString + @pDelimiter,

    N + 1,

    CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter, N + 1) - N - 1

    ) AS Item

    FROM

    Tally

    WHERE

    N < LEN(@pDelimiter + @pString + @pDelimiter)

    AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter

    Here's virtually the same code with the optimizations to reduce the concatenation and the unit math on "N". This works because you really only need to look at one delimiter at a time depending where you are at in the code. If you're in the SELECT list, you only need to look at the trailing comma. In the WHERE clause, you only need to look at the leading comma. The code has been optimized to take both of those notions into account which also reduces the unit math on "N".

    CREATE FUNCTION dbo.SplitMAX

    /*****************************************************************************************

    Purpose:

    High speed "Inline" Table Value Function (ITVF) accepts a delimited string and a single

    character delimiter and returns the element number (position) and element value as a

    table suitable for use in a FROM clause or a CROSS_APPLY.

    Dependencies:

    Requires a "Tally" table with at least the same number of values as the string to be

    split. Please see the following URL for what a Tally table is, how to build one, and

    how it replaces a WHILE Loop for many things.

    Programmer's Note:

    This function uses VARCHAR(MAX) as the input string parameter which automatically makes

    this function about 3-8 times slower than the VARCHAR(7999) version even though the code

    to do the split is identical.

    --Jeff Moden

    *****************************************************************************************/

    --===== Declare the I/O for this function

    (

    @pString VARCHAR(MAX),

    @pDelimiter CHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    (

    SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS ElementNumber,

    SUBSTRING(@pDelimiter+@pString,

    t.N+1,

    CHARINDEX(@pDelimiter,@pString+@pDelimiter,t.N)-t.N) AS ElementValue

    FROM dbo.Tally t

    WHERE N <= LEN(@pString)+1

    AND SUBSTRING(@pDelimiter+@pString,t.N,1) = @pDelimiter

    )

    If we now run both splitters on the same 10,000 rows of data, we'll see that those optimizations produce almost a 2 second performance gain... that's quite a bit when we're only talking 10,000 rows and CPU usage of less than 10 seconds for either method (again, on my box)....

    DECLARE @LinesToSplit INT

    SELECT @LinesToSplit = 10000

    PRINT REPLICATE('=',80)

    PRINT '========== Optimized function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ElementNumber,

    split.ElementValue

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    PRINT REPLICATE('=',80)

    PRINT '========== Original non-optimized function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ItemID,

    split.Item

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    Results from code above...

    ================================================================================

    ========== Optimized function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 7813 ms, elapsed time = 9325 ms.

    ================================================================================

    ========== Original non-optimized function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 9656 ms, elapsed time = 11506 ms.

    If the only thing that we change in the "SplitMAX" function is the datatype of the string parameter (and the name of the function to match), the code would look like this...

    CREATE FUNCTION dbo.Split8K

    /*****************************************************************************************

    Purpose:

    High speed "Inline" Table Value Function (ITVF) accepts a delimited string and a single

    character delimiter and returns the element number (position) and element value as a

    table suitable for use in a FROM clause or a CROSS_APPLY.

    Dependencies:

    Requires a "Tally" table with at least 8000 values. Please see the following URL for

    what a Tally table is, how to build one, and how it replaces a WHILE Loop for many things.

    Programmer's Note:

    This function uses VARCHAR(7999) as the input string parameter which automatically makes

    this function about 3-8 times faster than the VARCHAR(MAX) version even though the code

    to do the split is identical.

    --Jeff Moden

    *****************************************************************************************/

    --===== Declare the I/O for this function

    (

    @pString VARCHAR(7999),

    @pDelimiter CHAR(1)

    )

    RETURNS TABLE

    AS

    RETURN

    (

    SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS ElementNumber,

    SUBSTRING(@pDelimiter+@pString,

    t.N+1,

    CHARINDEX(@pDelimiter,@pString+@pDelimiter,t.N)-t.N) AS ElementValue

    FROM dbo.Tally t

    WHERE N <= LEN(@pString)+1

    AND SUBSTRING(@pDelimiter+@pString,t.N,1) = @pDelimiter

    )

    GO

    Now, let's run all 3 functions on the same 10,000 rows of data and see what we come up with...

    DECLARE @LinesToSplit INT

    SELECT @LinesToSplit = 10000

    PRINT REPLICATE('=',80)

    PRINT '========== Optimized VARCHAR(MAX)function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ElementNumber,

    split.ElementValue

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    PRINT REPLICATE('=',80)

    PRINT '========== Original non-optimized function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ItemID,

    split.Item

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    PRINT REPLICATE('=',80)

    PRINT '========== Optimized VARCHAR(7999)function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ElementNumber,

    split.ElementValue

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.Split8K(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    And, here are the results from that...

    ================================================================================

    ========== Optimized VARCHAR(MAX)function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 7844 ms, elapsed time = 9451 ms.

    ================================================================================

    ========== Original non-optimized function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 9359 ms, elapsed time = 11358 ms.

    ================================================================================

    ========== Optimized VARCHAR(7999)function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 1094 ms, elapsed time = 2500 ms.

    As you can see, the VARCHAR(7999) function beat the pants off of both VARCHAR(MAX) functions. The code in the SplitMAX and Split8K functions are absolutely identical except for the data type of the string parameter. Like I said, VARCHAR(MAX) sucks the life out of code and if you can avoid using VARCHAR(MAX), it makes for regular "Martha Stewart Moments".

    Of course, you should also avoid storing delimited and freetext data that contains important data elements in the database to begin with. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Almost forgot... even though the CROSS APPLY works well, if you want to shave off another 30-40% on CPU usage, then do the split directly in code instead of using a function... I've added such direct code to the end of the test code below...

    DECLARE @LinesToSplit INT

    SELECT @LinesToSplit = 10000

    PRINT REPLICATE('=',80)

    PRINT '========== Optimized VARCHAR(MAX)function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ElementNumber,

    split.ElementValue

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    PRINT REPLICATE('=',80)

    PRINT '========== Original non-optimized function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ItemID,

    split.Item

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    PRINT REPLICATE('=',80)

    PRINT '========== Optimized VARCHAR(7999)function =========='

    SET STATISTICS TIME ON

    SELECT csvsrc.SomeID,

    split.ElementNumber,

    split.ElementValue

    FROM dbo.JBMTest csvsrc

    CROSS APPLY dbo.Split8K(csvsrc.SomeCsv,',') split

    WHERE csvsrc.SomeID <= @LinesToSplit

    SET STATISTICS TIME OFF

    PRINT REPLICATE('=',80)

    PRINT '========== Optimized direct split code =========='

    SET STATISTICS TIME ON

    SELECT csvtab.SomeID,

    SUBSTRING(','+csvtab.SomeCsv,t.N+1,CHARINDEX(',',csvtab.SomeCsv+',',t.N)-t.N) AS SplitValue

    FROM dbo.Tally t

    CROSS JOIN dbo.JBMTest csvtab

    WHERE N <= LEN(csvtab.SomeCsv)+1

    AND SUBSTRING(','+csvtab.SomeCsv,t.N,1) = ','

    AND SomeID <= @LinesToSplit

    ORDER BY csvtab.SomeID, t.N

    SET STATISTICS TIME OFF

    ... and here are the results... the direct code blows everything away for this test...

    ================================================================================

    ========== Optimized VARCHAR(MAX)function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 7672 ms, elapsed time = 9402 ms.

    ================================================================================

    ========== Original non-optimized function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 9343 ms, elapsed time = 11209 ms.

    ================================================================================

    ========== Optimized VARCHAR(7999)function ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 1094 ms, elapsed time = 2554 ms.

    ================================================================================

    ========== Optimized direct split code ==========

    (60000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 453 ms, elapsed time = 2041 ms.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply