How do I extract numbers from a text field?

Question

Post reply

How do I extract numbers from a text field?

ayfour

Right there with Babe

Points: 734
More actions
November 12, 2009 at 10:43 pm

#142992

Hi All
I have a text field in a table that allows free form text.
The users will often put in 8 digit account codes in amongst the text in that field. Each record may contain 0, 1, 2 or 3 account codes. There may also be other numeric codes that do not conform to the account code standard.
I also know that all account codes start with a "6".
Please, can anyone help me with some code to extract these account codes?
Many thanks
K

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004250 More actions · Answer 1

What is the data type of the "field"? And, yes, it will make a difference.

Also, what is the maximum length of the data in the "field"? It could make a difference.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ayfour Right there with Babe Points: 734 More actions · Answer 2

ayfour

Right there with Babe

Points: 734

November 14, 2009 at 3:33 am

#1079347

Hi Jeff

It is a varchar field of length 90.

regards

K

LutzM SSC Guru Points: 107049 More actions · Answer 3

Would you please provide a (few) sample line(s) and your expected result?

Would make it easier for some of than talking plain theory...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

ayfour Right there with Babe Points: 734 More actions · Answer 4

Hi Lutz

3 example records to be filtered:

64112358 and 65888999

64199855 & 88859975 & 65889972

there are 4984118 and X4481566 and 69982214

--------

The expected results are:

64112358, 65888999

64199855, 65889972

69982214

---------

Any help with the above examples would be most appreciated.

regards

K

LutzM SSC Guru Points: 107049 More actions · Answer 5

Attached please find a sample code.

Some notes:

If you already have a unique value (like an ID) assigned to each row, the first CTE can be omitted.

Also, if you can guarantee, that each row will have at least one matching value, you could use the 3rd CTE as your final output, omitting the last SELECT.

Finally, I don't know how many rows you're going to process. It might help performance to store the results from cte2 in a (intermediate) temp table.

If you don't have a string split function yet, please search this side for it. There are several version available.

SET nocount ON

DECLARE @t TABLE (col varchar(90))

INSERT INTO @t SELECT '64112358 and 65888999'

INSERT INTO @t SELECT '64199855 & 88859975 & 65889972'

INSERT INTO @t SELECT 'there are 4984118 and X4481566 and 69982214'

INSERT INTO @t SELECT '699821E1' -- added by myself to get an empty row

-- assign a ID for each string . Might be obsolete if ID already available

;WITH cte1 AS

(SELECT row_number() over(order BY col) id, col

FROM @t t

),

-- split the string using a split string function, applied to each row of the source table

cte2 AS

(SELECT id, item

FROM cte1

CROSS apply dbo.DelimitedSplit(cte1.col,' ')

),

-- re-concatenate the string with matching numbers only, might result in NULL values if one row doesn't have a matching number in it

cte3 AS

(SELECT

stuff(

(

SELECT ', ' + Item

FROM cte2 t2

WHERE t2.id = t1.id

AND t2.item LIKE '6' + replicate('[0-9]', 7)

FOR XML PATH('')

),1,2,'') result

FROM cte2 t1

GROUP BY id

)

-- final select, elminating NULL values

SELECT * FROM cte3 WHERE result IS NOT NULL

/* result set

64112358, 65888999

64199855, 65889972

69982214*/

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004250 More actions · Answer 6

lmu92 (11/14/2009)
If you don't have a string split function yet, please search this side for it. There are several version available.

Heh... Lutz... why don't you just provide the one you used to test the code above with? 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 7

@jeff:

Well, I didn't post it for three reasons:

1) The function I'm using is mostly copied from one of the articles I referred to several dozen times in other posts so I felt I shouldn't post code I didn't came up with :blush:,

2) this function is using a Tally table and posting the code to get this one I really didn't felt like since it would be "slightly off topic", and

3) the OP now has a solution to his/her problem and I thought he/she could spend some time doing research on string split functions. Beside of being fun to try to understand all their various methods I think it's really educating. 🙂

Anyway, since my "split string tutor" ( 😉 ) is asking me to post what he taught me, here it is:

CREATE FUNCTION [dbo].[DelimitedSplit] (

@pString VARCHAR(max),

@pDelimiter CHAR(1)

)

RETURNS TABLE

AS

RETURN

SELECT

row_number() OVER (ORDER BY N) AS ItemID,

SUBSTRING(@pDelimiter + @pString + @pDelimiter,

N + 1,

CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter, N + 1) - N - 1

) AS Item

FROM

Tally

WHERE

N < LEN(@pDelimiter + @pString + @pDelimiter)

AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter

The changes I made to the code you posted in your Tally table article are made to use it as ITVF.

Side note: The CREATE FUNCTION chapter in BOL is the one thats highlighted when you search for "create FUN" 😀

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

ayfour Right there with Babe Points: 734 More actions · Answer 8

Hi Lutz, Jeff

Thanks very much for your help.

You have introduced me to some new concepts and techniques.

I will give it a go, and see how it pans out.

Thanks again.

K

Jeff Moden SSC Guru Points: 1004250 More actions · Answer 9

Your welcome K... if you really want to learn how the Tally table works to replace WHILE Loops like Lutz did above, click on the TALLY table link in his signature line.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

LutzM SSC Guru Points: 107049 More actions · Answer 10

Jeff Moden (11/14/2009)
Your welcome K... if you really want to learn how the Tally table works to replace WHILE Loops like Lutz did above, click on the TALLY table link in his signature line.

Hey Jeff, how does it feel to point somebody to your own article referenced in the signature of someone else? 😀

Note: How did I do in terms of changing your string split function? Passed or failed?

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Jeff Moden SSC Guru Points: 1004250 More actions · Answer 11

@Jeff:
Well, I didn't post it for three reasons:

Heh... no worries, Lutz. It just struck me as odd because you normally cough up a whole solution. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004250 More actions · Answer 12

... and the split function is just fine. I've tweeked it with some optimizations over time... I'll be right back to show you...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004250 More actions · Answer 13

here's the optimizations I was talking about. There are two different types of optimizations... the first one is to reduce the number of concatenations and other calculations as much as possible. You'll see that in the function code I've included.

The second "optimization" isn't really an optimization... it's a realization of how much the use of VARCHAR(MAX) sucks the life out of code. My speculation on that was that VARCHAR(MAX) is always treated as an "out of row" datatype but it was just a speculation. Others have insisted that VARCHAR(MAX) will be "in row" if it fits. The bottom line is, whatever... VARCHAR(MAX) performs much slower than otherwise equivalent code that uses VARCHAR(8000). In English, don't use VARCHAR(MAX) if VARCHAR(8000) will do.

To commence, here's my normal million row test table with some code that builds 6 randomly generated INT's separated by commas... the rest of the columns are most just because this code is used for a lot of different tests and they also give a little bulk to the table as if it were a real table... for all those that wish to "play along", this doesn't take very long to build (49 seconds on my 7 year old, single 1.8 GHz desktop box).

--===== Create and populate a 1,000,000 row test table.

-- Column "RowNum" has a range of 1 to 100,000 unique numbers

-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- Column "SomeCSV" contains a random 6 element CSV.

-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

-- Jeff Moden

SELECT TOP 1000000

SomeID = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),

SomeCSV = CAST(

CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','

+ CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','

+ CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))

AS VARCHAR(8000)),

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

SomeHex12 = RIGHT(NEWID(),12)

INTO dbo.JBMTest

FROM Master.dbo.SysColumns t1

CROSS JOIN

Master.dbo.SysColumns t2

--===== Create a clustered primary key.

-- Takes about 1 second to execute.

ALTER TABLE dbo.JBMTest

ADD PRIMARY KEY CLUSTERED (SomeID)

Here's Lutz's good function... notice the concatenation that puts a delimiter on both ends of the data being split. There's also a fair bit of unit addition and subtraction on "N". That's the way I used to do it, as well. It does make things a little bit easier to understand if you have to troubleshoot the code, but it also makes it a fair bit slower over, say, 10,000 rows of data to split. Also notice that the string to be split is declared as a VARCHAR(MAX)...

--===== Lutz' Splitter Function from above...

CREATE FUNCTION [dbo].[DelimitedSplit] (

@pString VARCHAR(max),

@pDelimiter CHAR(1)

)

RETURNS TABLE

AS

RETURN

SELECT

row_number() OVER (ORDER BY N) AS ItemID,

SUBSTRING( @pDelimiter + @pString + @pDelimiter,

N + 1,

CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter, N + 1) - N - 1

) AS Item

FROM

Tally

WHERE

N < LEN(@pDelimiter + @pString + @pDelimiter)

AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter

Here's virtually the same code with the optimizations to reduce the concatenation and the unit math on "N". This works because you really only need to look at one delimiter at a time depending where you are at in the code. If you're in the SELECT list, you only need to look at the trailing comma. In the WHERE clause, you only need to look at the leading comma. The code has been optimized to take both of those notions into account which also reduces the unit math on "N".

CREATE FUNCTION dbo.SplitMAX

/*****************************************************************************************

Purpose:

High speed "Inline" Table Value Function (ITVF) accepts a delimited string and a single

character delimiter and returns the element number (position) and element value as a

table suitable for use in a FROM clause or a CROSS_APPLY.

Dependencies:

Requires a "Tally" table with at least the same number of values as the string to be

split. Please see the following URL for what a Tally table is, how to build one, and

how it replaces a WHILE Loop for many things.

Programmer's Note:

This function uses VARCHAR(MAX) as the input string parameter which automatically makes

this function about 3-8 times slower than the VARCHAR(7999) version even though the code

to do the split is identical.

--Jeff Moden

*****************************************************************************************/

--===== Declare the I/O for this function

(

@pString VARCHAR(MAX),

@pDelimiter CHAR(1)

)

RETURNS TABLE

AS

RETURN

(

SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS ElementNumber,

SUBSTRING(@pDelimiter+@pString,

t.N+1,

CHARINDEX(@pDelimiter,@pString+@pDelimiter,t.N)-t.N) AS ElementValue

FROM dbo.Tally t

WHERE N <= LEN(@pString)+1

AND SUBSTRING(@pDelimiter+@pString,t.N,1) = @pDelimiter

)

If we now run both splitters on the same 10,000 rows of data, we'll see that those optimizations produce almost a 2 second performance gain... that's quite a bit when we're only talking 10,000 rows and CPU usage of less than 10 seconds for either method (again, on my box)....

DECLARE @LinesToSplit INT

SELECT @LinesToSplit = 10000

PRINT REPLICATE('=',80)

PRINT '========== Optimized function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ElementNumber,

split.ElementValue

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

PRINT REPLICATE('=',80)

PRINT '========== Original non-optimized function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ItemID,

split.Item

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

Results from code above...

================================================================================

========== Optimized function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 7813 ms, elapsed time = 9325 ms.

================================================================================

========== Original non-optimized function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 9656 ms, elapsed time = 11506 ms.

If the only thing that we change in the "SplitMAX" function is the datatype of the string parameter (and the name of the function to match), the code would look like this...

CREATE FUNCTION dbo.Split8K

/*****************************************************************************************

Purpose:

High speed "Inline" Table Value Function (ITVF) accepts a delimited string and a single

character delimiter and returns the element number (position) and element value as a

table suitable for use in a FROM clause or a CROSS_APPLY.

Dependencies:

Requires a "Tally" table with at least 8000 values. Please see the following URL for

what a Tally table is, how to build one, and how it replaces a WHILE Loop for many things.

Programmer's Note:

This function uses VARCHAR(7999) as the input string parameter which automatically makes

this function about 3-8 times faster than the VARCHAR(MAX) version even though the code

to do the split is identical.

--Jeff Moden

*****************************************************************************************/

--===== Declare the I/O for this function

(

@pString VARCHAR(7999),

@pDelimiter CHAR(1)

)

RETURNS TABLE

AS

RETURN

(

SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS ElementNumber,

SUBSTRING(@pDelimiter+@pString,

t.N+1,

CHARINDEX(@pDelimiter,@pString+@pDelimiter,t.N)-t.N) AS ElementValue

FROM dbo.Tally t

WHERE N <= LEN(@pString)+1

AND SUBSTRING(@pDelimiter+@pString,t.N,1) = @pDelimiter

)

GO

Now, let's run all 3 functions on the same 10,000 rows of data and see what we come up with...

DECLARE @LinesToSplit INT

SELECT @LinesToSplit = 10000

PRINT REPLICATE('=',80)

PRINT '========== Optimized VARCHAR(MAX)function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ElementNumber,

split.ElementValue

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

PRINT REPLICATE('=',80)

PRINT '========== Original non-optimized function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ItemID,

split.Item

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

PRINT REPLICATE('=',80)

PRINT '========== Optimized VARCHAR(7999)function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ElementNumber,

split.ElementValue

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.Split8K(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

And, here are the results from that...

================================================================================

========== Optimized VARCHAR(MAX)function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 7844 ms, elapsed time = 9451 ms.

================================================================================

========== Original non-optimized function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 9359 ms, elapsed time = 11358 ms.

================================================================================

========== Optimized VARCHAR(7999)function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 1094 ms, elapsed time = 2500 ms.

As you can see, the VARCHAR(7999) function beat the pants off of both VARCHAR(MAX) functions. The code in the SplitMAX and Split8K functions are absolutely identical except for the data type of the string parameter. Like I said, VARCHAR(MAX) sucks the life out of code and if you can avoid using VARCHAR(MAX), it makes for regular "Martha Stewart Moments".

Of course, you should also avoid storing delimited and freetext data that contains important data elements in the database to begin with. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004250 More actions · Answer 14

Almost forgot... even though the CROSS APPLY works well, if you want to shave off another 30-40% on CPU usage, then do the split directly in code instead of using a function... I've added such direct code to the end of the test code below...

DECLARE @LinesToSplit INT

SELECT @LinesToSplit = 10000

PRINT REPLICATE('=',80)

PRINT '========== Optimized VARCHAR(MAX)function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ElementNumber,

split.ElementValue

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

PRINT REPLICATE('=',80)

PRINT '========== Original non-optimized function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ItemID,

split.Item

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

PRINT REPLICATE('=',80)

PRINT '========== Optimized VARCHAR(7999)function =========='

SET STATISTICS TIME ON

SELECT csvsrc.SomeID,

split.ElementNumber,

split.ElementValue

FROM dbo.JBMTest csvsrc

CROSS APPLY dbo.Split8K(csvsrc.SomeCsv,',') split

WHERE csvsrc.SomeID <= @LinesToSplit

SET STATISTICS TIME OFF

PRINT REPLICATE('=',80)

PRINT '========== Optimized direct split code =========='

SET STATISTICS TIME ON

SELECT csvtab.SomeID,

SUBSTRING(','+csvtab.SomeCsv,t.N+1,CHARINDEX(',',csvtab.SomeCsv+',',t.N)-t.N) AS SplitValue

FROM dbo.Tally t

CROSS JOIN dbo.JBMTest csvtab

WHERE N <= LEN(csvtab.SomeCsv)+1

AND SUBSTRING(','+csvtab.SomeCsv,t.N,1) = ','

AND SomeID <= @LinesToSplit

ORDER BY csvtab.SomeID, t.N

SET STATISTICS TIME OFF

... and here are the results... the direct code blows everything away for this test...

================================================================================

========== Optimized VARCHAR(MAX)function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 7672 ms, elapsed time = 9402 ms.

================================================================================

========== Original non-optimized function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 9343 ms, elapsed time = 11209 ms.

================================================================================

========== Optimized VARCHAR(7999)function ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 1094 ms, elapsed time = 2554 ms.

================================================================================

========== Optimized direct split code ==========

(60000 row(s) affected)

SQL Server Execution Times:

CPU time = 453 ms, elapsed time = 2041 ms.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)