November 12, 2009 at 10:43 pm
Hi All
I have a text field in a table that allows free form text.
The users will often put in 8 digit account codes in amongst the text in that field. Each record may contain 0, 1, 2 or 3 account codes. There may also be other numeric codes that do not conform to the account code standard.
I also know that all account codes start with a "6".
Please, can anyone help me with some code to extract these account codes?
Many thanks
K
November 12, 2009 at 11:50 pm
What is the data type of the "field"? And, yes, it will make a difference.
Also, what is the maximum length of the data in the "field"? It could make a difference.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2009 at 3:33 am
Hi Jeff
It is a varchar field of length 90.
regards
K
November 14, 2009 at 4:25 am
Would you please provide a (few) sample line(s) and your expected result?
Would make it easier for some of than talking plain theory...
November 14, 2009 at 5:56 am
Hi Lutz
3 example records to be filtered:
64112358 and 65888999
64199855 & 88859975 & 65889972
there are 4984118 and X4481566 and 69982214
--------
The expected results are:
64112358, 65888999
64199855, 65889972
69982214
---------
Any help with the above examples would be most appreciated.
regards
K
November 14, 2009 at 8:05 am
Attached please find a sample code.
Some notes:
If you already have a unique value (like an ID) assigned to each row, the first CTE can be omitted.
Also, if you can guarantee, that each row will have at least one matching value, you could use the 3rd CTE as your final output, omitting the last SELECT.
Finally, I don't know how many rows you're going to process. It might help performance to store the results from cte2 in a (intermediate) temp table.
If you don't have a string split function yet, please search this side for it. There are several version available.
SET nocount ON
DECLARE @t TABLE (col varchar(90))
INSERT INTO @t SELECT '64112358 and 65888999'
INSERT INTO @t SELECT '64199855 & 88859975 & 65889972'
INSERT INTO @t SELECT 'there are 4984118 and X4481566 and 69982214'
INSERT INTO @t SELECT '699821E1' -- added by myself to get an empty row
-- assign a ID for each string . Might be obsolete if ID already available
;WITH cte1 AS
(SELECT row_number() over(order BY col) id, col
FROM @t t
),
-- split the string using a split string function, applied to each row of the source table
cte2 AS
(SELECT id, item
FROM cte1
CROSS apply dbo.DelimitedSplit(cte1.col,' ')
),
-- re-concatenate the string with matching numbers only, might result in NULL values if one row doesn't have a matching number in it
cte3 AS
(SELECT
stuff(
(
SELECT ', ' + Item
FROM cte2 t2
WHERE t2.id = t1.id
AND t2.item LIKE '6' + replicate('[0-9]', 7)
FOR XML PATH('')
),1,2,'') result
FROM cte2 t1
GROUP BY id
)
-- final select, elminating NULL values
SELECT * FROM cte3 WHERE result IS NOT NULL
/* result set
64112358, 65888999
64199855, 65889972
69982214*/
November 14, 2009 at 1:01 pm
lmu92 (11/14/2009)
If you don't have a string split function yet, please search this side for it. There are several version available.
Heh... Lutz... why don't you just provide the one you used to test the code above with? 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2009 at 2:06 pm
Well, I didn't post it for three reasons:
1) The function I'm using is mostly copied from one of the articles I referred to several dozen times in other posts so I felt I shouldn't post code I didn't came up with :blush:,
2) this function is using a Tally table and posting the code to get this one I really didn't felt like since it would be "slightly off topic", and
3) the OP now has a solution to his/her problem and I thought he/she could spend some time doing research on string split functions. Beside of being fun to try to understand all their various methods I think it's really educating. 🙂
Anyway, since my "split string tutor" ( 😉 ) is asking me to post what he taught me, here it is:
CREATE FUNCTION [dbo].[DelimitedSplit] (
@pString VARCHAR(max),
@pDelimiter CHAR(1)
)
RETURNS TABLE
AS
RETURN
SELECT
row_number() OVER (ORDER BY N) AS ItemID,
SUBSTRING(@pDelimiter + @pString + @pDelimiter,
N + 1,
CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter, N + 1) - N - 1
) AS Item
FROM
Tally
WHERE
N < LEN(@pDelimiter + @pString + @pDelimiter)
AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter
The changes I made to the code you posted in your Tally table article are made to use it as ITVF.
Side note: The CREATE FUNCTION chapter in BOL is the one thats highlighted when you search for "create FUN" 😀
November 14, 2009 at 3:00 pm
Hi Lutz, Jeff
Thanks very much for your help.
You have introduced me to some new concepts and techniques.
I will give it a go, and see how it pans out.
Thanks again.
K
November 14, 2009 at 3:20 pm
Your welcome K... if you really want to learn how the Tally table works to replace WHILE Loops like Lutz did above, click on the TALLY table link in his signature line.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2009 at 3:42 pm
Jeff Moden (11/14/2009)
Your welcome K... if you really want to learn how the Tally table works to replace WHILE Loops like Lutz did above, click on the TALLY table link in his signature line.
Hey Jeff, how does it feel to point somebody to your own article referenced in the signature of someone else? 😀
Note: How did I do in terms of changing your string split function? Passed or failed?
November 14, 2009 at 4:42 pm
@Jeff:
Well, I didn't post it for three reasons:
Heh... no worries, Lutz. It just struck me as odd because you normally cough up a whole solution. 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2009 at 4:45 pm
... and the split function is just fine. I've tweeked it with some optimizations over time... I'll be right back to show you...
--Jeff Moden
Change is inevitable... Change for the better is not.
November 15, 2009 at 7:16 am
here's the optimizations I was talking about. There are two different types of optimizations... the first one is to reduce the number of concatenations and other calculations as much as possible. You'll see that in the function code I've included.
The second "optimization" isn't really an optimization... it's a realization of how much the use of VARCHAR(MAX) sucks the life out of code. My speculation on that was that VARCHAR(MAX) is always treated as an "out of row" datatype but it was just a speculation. Others have insisted that VARCHAR(MAX) will be "in row" if it fits. The bottom line is, whatever... VARCHAR(MAX) performs much slower than otherwise equivalent code that uses VARCHAR(8000). In English, don't use VARCHAR(MAX) if VARCHAR(8000) will do.
To commence, here's my normal million row test table with some code that builds 6 randomly generated INT's separated by commas... the rest of the columns are most just because this code is used for a lot of different tests and they also give a little bulk to the table as if it were a real table... for all those that wish to "play along", this doesn't take very long to build (49 seconds on my 7 year old, single 1.8 GHz desktop box).
--===== Create and populate a 1,000,000 row test table.
-- Column "RowNum" has a range of 1 to 100,000 unique numbers
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeCSV" contains a random 6 element CSV.
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 1000000
SomeID = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST(
CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','
+ CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','
+ CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','
+ CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','
+ CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))+','
+ CAST(ABS(CHECKSUM(NEWID()))%2000000000 AS VARCHAR(10))
AS VARCHAR(8000)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns t1
CROSS JOIN
Master.dbo.SysColumns t2
--===== Create a clustered primary key.
-- Takes about 1 second to execute.
ALTER TABLE dbo.JBMTest
ADD PRIMARY KEY CLUSTERED (SomeID)
Here's Lutz's good function... notice the concatenation that puts a delimiter on both ends of the data being split. There's also a fair bit of unit addition and subtraction on "N". That's the way I used to do it, as well. It does make things a little bit easier to understand if you have to troubleshoot the code, but it also makes it a fair bit slower over, say, 10,000 rows of data to split. Also notice that the string to be split is declared as a VARCHAR(MAX)...
--===== Lutz' Splitter Function from above...
CREATE FUNCTION [dbo].[DelimitedSplit] (
@pString VARCHAR(max),
@pDelimiter CHAR(1)
)
RETURNS TABLE
AS
RETURN
SELECT
row_number() OVER (ORDER BY N) AS ItemID,
SUBSTRING( @pDelimiter + @pString + @pDelimiter,
N + 1,
CHARINDEX(@pDelimiter,@pDelimiter + @pString + @pDelimiter, N + 1) - N - 1
) AS Item
FROM
Tally
WHERE
N < LEN(@pDelimiter + @pString + @pDelimiter)
AND SUBSTRING(@pDelimiter + @pString + @pDelimiter,N,1) = @pDelimiter
Here's virtually the same code with the optimizations to reduce the concatenation and the unit math on "N". This works because you really only need to look at one delimiter at a time depending where you are at in the code. If you're in the SELECT list, you only need to look at the trailing comma. In the WHERE clause, you only need to look at the leading comma. The code has been optimized to take both of those notions into account which also reduces the unit math on "N".
CREATE FUNCTION dbo.SplitMAX
/*****************************************************************************************
Purpose:
High speed "Inline" Table Value Function (ITVF) accepts a delimited string and a single
character delimiter and returns the element number (position) and element value as a
table suitable for use in a FROM clause or a CROSS_APPLY.
Dependencies:
Requires a "Tally" table with at least the same number of values as the string to be
split. Please see the following URL for what a Tally table is, how to build one, and
how it replaces a WHILE Loop for many things.
Programmer's Note:
This function uses VARCHAR(MAX) as the input string parameter which automatically makes
this function about 3-8 times slower than the VARCHAR(7999) version even though the code
to do the split is identical.
--Jeff Moden
*****************************************************************************************/
--===== Declare the I/O for this function
(
@pString VARCHAR(MAX),
@pDelimiter CHAR(1)
)
RETURNS TABLE
AS
RETURN
(
SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS ElementNumber,
SUBSTRING(@pDelimiter+@pString,
t.N+1,
CHARINDEX(@pDelimiter,@pString+@pDelimiter,t.N)-t.N) AS ElementValue
FROM dbo.Tally t
WHERE N <= LEN(@pString)+1
AND SUBSTRING(@pDelimiter+@pString,t.N,1) = @pDelimiter
)
If we now run both splitters on the same 10,000 rows of data, we'll see that those optimizations produce almost a 2 second performance gain... that's quite a bit when we're only talking 10,000 rows and CPU usage of less than 10 seconds for either method (again, on my box)....
DECLARE @LinesToSplit INT
SELECT @LinesToSplit = 10000
PRINT REPLICATE('=',80)
PRINT '========== Optimized function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ElementNumber,
split.ElementValue
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
PRINT REPLICATE('=',80)
PRINT '========== Original non-optimized function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ItemID,
split.Item
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
Results from code above...
================================================================================
========== Optimized function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 7813 ms, elapsed time = 9325 ms.
================================================================================
========== Original non-optimized function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 9656 ms, elapsed time = 11506 ms.
If the only thing that we change in the "SplitMAX" function is the datatype of the string parameter (and the name of the function to match), the code would look like this...
CREATE FUNCTION dbo.Split8K
/*****************************************************************************************
Purpose:
High speed "Inline" Table Value Function (ITVF) accepts a delimited string and a single
character delimiter and returns the element number (position) and element value as a
table suitable for use in a FROM clause or a CROSS_APPLY.
Dependencies:
Requires a "Tally" table with at least 8000 values. Please see the following URL for
what a Tally table is, how to build one, and how it replaces a WHILE Loop for many things.
Programmer's Note:
This function uses VARCHAR(7999) as the input string parameter which automatically makes
this function about 3-8 times faster than the VARCHAR(MAX) version even though the code
to do the split is identical.
--Jeff Moden
*****************************************************************************************/
--===== Declare the I/O for this function
(
@pString VARCHAR(7999),
@pDelimiter CHAR(1)
)
RETURNS TABLE
AS
RETURN
(
SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS ElementNumber,
SUBSTRING(@pDelimiter+@pString,
t.N+1,
CHARINDEX(@pDelimiter,@pString+@pDelimiter,t.N)-t.N) AS ElementValue
FROM dbo.Tally t
WHERE N <= LEN(@pString)+1
AND SUBSTRING(@pDelimiter+@pString,t.N,1) = @pDelimiter
)
GO
Now, let's run all 3 functions on the same 10,000 rows of data and see what we come up with...
DECLARE @LinesToSplit INT
SELECT @LinesToSplit = 10000
PRINT REPLICATE('=',80)
PRINT '========== Optimized VARCHAR(MAX)function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ElementNumber,
split.ElementValue
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
PRINT REPLICATE('=',80)
PRINT '========== Original non-optimized function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ItemID,
split.Item
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
PRINT REPLICATE('=',80)
PRINT '========== Optimized VARCHAR(7999)function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ElementNumber,
split.ElementValue
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.Split8K(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
And, here are the results from that...
================================================================================
========== Optimized VARCHAR(MAX)function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 7844 ms, elapsed time = 9451 ms.
================================================================================
========== Original non-optimized function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 9359 ms, elapsed time = 11358 ms.
================================================================================
========== Optimized VARCHAR(7999)function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 1094 ms, elapsed time = 2500 ms.
As you can see, the VARCHAR(7999) function beat the pants off of both VARCHAR(MAX) functions. The code in the SplitMAX and Split8K functions are absolutely identical except for the data type of the string parameter. Like I said, VARCHAR(MAX) sucks the life out of code and if you can avoid using VARCHAR(MAX), it makes for regular "Martha Stewart Moments".
Of course, you should also avoid storing delimited and freetext data that contains important data elements in the database to begin with. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
November 15, 2009 at 7:52 am
Almost forgot... even though the CROSS APPLY works well, if you want to shave off another 30-40% on CPU usage, then do the split directly in code instead of using a function... I've added such direct code to the end of the test code below...
DECLARE @LinesToSplit INT
SELECT @LinesToSplit = 10000
PRINT REPLICATE('=',80)
PRINT '========== Optimized VARCHAR(MAX)function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ElementNumber,
split.ElementValue
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.SplitMAX(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
PRINT REPLICATE('=',80)
PRINT '========== Original non-optimized function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ItemID,
split.Item
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.DelimitedSplit(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
PRINT REPLICATE('=',80)
PRINT '========== Optimized VARCHAR(7999)function =========='
SET STATISTICS TIME ON
SELECT csvsrc.SomeID,
split.ElementNumber,
split.ElementValue
FROM dbo.JBMTest csvsrc
CROSS APPLY dbo.Split8K(csvsrc.SomeCsv,',') split
WHERE csvsrc.SomeID <= @LinesToSplit
SET STATISTICS TIME OFF
PRINT REPLICATE('=',80)
PRINT '========== Optimized direct split code =========='
SET STATISTICS TIME ON
SELECT csvtab.SomeID,
SUBSTRING(','+csvtab.SomeCsv,t.N+1,CHARINDEX(',',csvtab.SomeCsv+',',t.N)-t.N) AS SplitValue
FROM dbo.Tally t
CROSS JOIN dbo.JBMTest csvtab
WHERE N <= LEN(csvtab.SomeCsv)+1
AND SUBSTRING(','+csvtab.SomeCsv,t.N,1) = ','
AND SomeID <= @LinesToSplit
ORDER BY csvtab.SomeID, t.N
SET STATISTICS TIME OFF
... and here are the results... the direct code blows everything away for this test...
================================================================================
========== Optimized VARCHAR(MAX)function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 7672 ms, elapsed time = 9402 ms.
================================================================================
========== Original non-optimized function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 9343 ms, elapsed time = 11209 ms.
================================================================================
========== Optimized VARCHAR(7999)function ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 1094 ms, elapsed time = 2554 ms.
================================================================================
========== Optimized direct split code ==========
(60000 row(s) affected)
SQL Server Execution Times:
CPU time = 453 ms, elapsed time = 2041 ms.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply