June 25, 2014 at 6:07 am
we all know Inline Table Value Functions perform at least a couple of order of magnitude faster than a scalar function, but there are some functions I have in my collection that are scalar, that , so far, I don't have an ITVF equivilent for.
I've got a slow running process, that i found is using a scalar function to strip out non numeric characters; my knee jerk reaction was "I'll just replace that and change to a cross apply /outer apply to do the same thing!Q, and then realized i wasn't making any headway on modifying an existing example to become an ITVF.
before i reinvent the wheel, does anyone have one?
here's an example of a scalar function, that is featuring a Tally Table to build the "right" string. run that on a million rows, and it bogs things down a wee bit. :w00t:
IF OBJECT_ID('[dbo].[StripNonNumeric]') IS NOT NULL
DROP FUNCTION [dbo].[StripNonNumeric]
GO
CREATE FUNCTION StripNonNumeric(@OriginalText VARCHAR(8000))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(8000)
;WITH tally (N) as
(SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)
FROM Master.dbo.SysColumns sc1
CROSS JOIN Master.dbo.SysColumns sc2)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
--ascii numbers are 48(for '0') thru 57 (for '9')
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57 THEN SUBSTRING(@OriginalText,Tally.N,1) ELSE '' END
FROM tally WHERE Tally.N <= LEN(@OriginalText)
RETURN @CleanedText
END
Lowell
June 25, 2014 at 8:21 am
Taking your fine code as a starting point I came up with this. I didn't test against a million rows but I think maybe I will do that now.
CREATE FUNCTION StripNonNumeric_itvf(@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select STUFF(
(
SELECT SUBSTRING(@OriginalText, t.N, 1)
FROM tally t
WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57
FOR XML PATH('')
), 1 ,0 , '') as CleanedText
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
June 25, 2014 at 9:17 am
smacking myself in the head!
I didn't think of using FOR XML to handle the concatenation; that, sir, is beautiful and inspired.
You'd think i'd know better, so double kudos to you.
i'll test it on my process, but I'm pretty sure it'll be quicker than what is there now.
Lowell
June 25, 2014 at 9:21 am
Lowell (6/25/2014)
smacking myself in the head!I didn't think of using FOR XML to handle the concatenation; that, sir, is beautiful and inspired.
You'd think i'd know better, so double kudos to you.
i'll test it on my process, but I'm pretty sure it'll be quicker than what is there now.
Awesome. Glad that will work. Sometime it is easier to see a solution when aren't familiar with the current process. I have added that to my own toolbox as that could definitely come in handy at some point.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
June 25, 2014 at 11:23 am
Gents
First: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.
on my laptop (Windows 8, 2 cores/4 logical cpu/16GB run SQL Server 2014 Ent) I ran your solution (with results on) and it takes 31 seconds complete (my test harness data below). I tried to improve the time in your query but had no success.
I have an ngrams function that I use for this kind of thing. Below is the function with my notes on how to use it:
IF OBJECT_ID('tempdb.dbo.nGrams8K') IS NOT NULL DROP FUNCTION dbo.nGrams8K
GO
CREATE FUNCTION dbo.nGrams8K (@string varchar(8000), @k int)
/********************************************************************
Created by: Alan Burstein
Created on: 3/10/2014
Last Updated on: 5/22/2015
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
For more information see: http://en.wikipedia.org/wiki/N-gram
Use:
Outputs a stream of tokens based on an input string.
Similar to mdq.nGrams:
http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
Except it only returns characters as long as K.
nGrams8K also includes the position of the "Gram" in the string.
Examples of how to use included as comments after the code.
********************************************************************/
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(N) AS (SELECT 1 FROM (VALUES (null),(null),(null),(null),(null)) x(n)),
E3(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b CROSS JOIN E1 c),
iTally(N) AS
(
SELECT TOP (LEN(@string)-(@k-1)) CHECKSUM(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))
FROM E3 a CROSS JOIN E3 b
)
SELECT
position = N,
token = SUBSTRING(@string,N,@k)
FROM iTally;
GO
/********************************************************************
(1) Basic Use
-----------------------------------------------------------
-- (A) With @k as 1,2 & 3:
SELECT position, token FROM dbo.nGrams8K('abcdefg',1);
SELECT position, token FROM dbo.nGrams8K('abcdefg',2);
SELECT position, token FROM dbo.nGrams8K('abcdefg',3);
-- (B) Using variables
DECLARE
@string varchar(20) = 'abcdefg12345', @tokenLen tinyint = 3;
SELECT position, token FROM dbo.nGrams8K(@string,@tokenLen)
GO
(2) Character Count (including 0 Counts)
-----------------------------------------------------------
-- (A) Basic character count
SELECT token = ISNULL(token,'Total:'), count(*)
FROM dbo.nGrams8K(newid(),1)
GROUP BY token
WITH ROLLUP;-- Added rollup for display
-- (B) Character Account including gaps
DECLARE
@alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ',
@string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dogs.';
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(@alphabet,1) a
LEFT JOIN dbo.nGrams8K(@string,1) b ON a.token=b.token
GROUP BY a.token
ORDER BY a.token; -- not required, for display only
GO
-- (C) Let's try French ordered by most frequent
DECLARE
@alphabet VARCHAR(36)='abcdefghijklmnopqrstuvwxyzéèçëòôöùàâ',
@string VARCHAR(100)='Le renard vert rapide saute par dessus le chien paresseux et le chien paresseux juste posé là.';
WITH charcount AS
(
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(@alphabet,1) a
LEFT JOIN dbo.nGrams8K(@string,1) b ON a.token=b.token
GROUP BY a.token
)
SELECT * FROM charcount ORDER BY ttl DESC;
GO
(3) a *SIMPLE* word count
-----------------------------------------------------------
DECLARE @string varchar(100)='THE QUICK Green FOX JUMPED OVER THE LAZY DOGS BACK';
SELECT @string AS string, count(*)+1 AS words
FROM dbo.nGrams8K(@string,1)
WHERE [token]=' '
GO
(4) search for occurances and location of a substring
-----------------------------------------------------------
DECLARE
@string VARCHAR(100)='The green fox jumps over the dog and the dog just laid there.',
@searchString VARCHAR(100)='The';
SELECT position, token AS searchString
FROM dbo.nGrams8K(@string,LEN(@searchstring)) b
WHERE token=@searchString;
GO
(5) Strip non-numeric characters from a string
-----------------------------------------------------------
DECLARE @string varchar(100) = 'abc123xyz555xxx999!';
WITH stripNonNumeric(cleanstring) AS
(
SELECT token+''
FROM dbo.nGrams8K(@string,1)
WHERE token LIKE '[0-9]'
FOR XML PATH('')
)
SELECT cleanstring
FROM stripNonNumeric;
(6) Find all occurances of a pattern in a string
------------------------------------------------------------
DECLARE
@string varchar(100) = 'zz12x345xxx555abc1234zz5xxx',
@pat varchar(100) = '[a-z][a-z][0-9]',
@len int = 3;
SELECT start_pos = position, token
FROM dbo.nGrams8K(@string,@len)
WHERE token LIKE @pat
SELECT * FROM dbo.findpat8k(@string,@pat,3);
GO
(7) Find the longest common substring between 2 strings
------------------------------------------------------------
-- (A) The Function
IF EXISTS
(
SELECT * FROM INFORMATION_SCHEMA.ROUTINES
WHERE ROUTINE_SCHEMA='dbo' AND ROUTINE_NAME = 'lcss'
)
DROP FUNCTION dbo.lcss;
GO
CREATE FUNCTION dbo.lcss(@string1 varchar(100), @string2 varchar(100))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH iTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM dbo.nGrams8K
(
CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string1
ELSE @string2 END,1
)
)
SELECT TOP (1) with ties token
FROM iTally
CROSS APPLY
dbo.nGrams8K
(
CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string1
ELSE @string2 END, n
)
WHERE N <=
CASE WHEN LEN(@string1)<=LEN(@string2) THEN LEN(@string1)
ELSE LEN(@string2) END
AND charindex
(
token,
CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string2
ELSE @string1 END
) > 0
ORDER BY len(token) DESC;
GO
-- (B) example of how to use lcss
DECLARE
@string1 varchar(100) = '999xxyyyzaa99xxyyyzz',
@string2 varchar(100) = '000xxxyyyzzz';
SELECT string1 = @string1, string2 = @string2, token
FROM dbo.lcss(@string1, @string2);
********************************************************************/
GO
I ran the following 10K Row Test:
SET NOCOUNT ON;
DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
PRINT 'Sean'+char(13);
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf(string);
PRINT 'Alan'+char(13);
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf_ajb(string);
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
... and here's the results:
Sean
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 1, logical reads 63, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 31172 ms, elapsed time = 31213 ms.
Alan
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 1, logical reads 63, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 328 ms, elapsed time = 384 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
I have no idea why my NGrams solution is so much faster considering that we are essentially doing the same thing. I included the query plans for both.
Lowell: I hope this helps.
-- Itzik Ben-Gan 2001
June 25, 2014 at 12:27 pm
Alan.B (6/25/2014)
GentsFirst: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.
Nice. I didn't see your test data and haven't had a chance to roll my own yet. I am guessing that the big difference is that my code uses STUFF and yours does not.
If you still have your test harness handy try with this.
ALTER FUNCTION [dbo].[StripNonNumeric_itvf](@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
, clean_text(CleanedText) as
(
SELECT SUBSTRING(@OriginalText, t.N, 1)
FROM tally t
WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
select CleanedText
from clean_text
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
June 25, 2014 at 1:45 pm
Sean Lange (6/25/2014)
Alan.B (6/25/2014)
GentsFirst: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.
Nice. I didn't see your test data and haven't had a chance to roll my own yet. I am guessing that the big difference is that my code uses STUFF and yours does not.
If you still have your test harness handy try with this.
ALTER FUNCTION [dbo].[StripNonNumeric_itvf](@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
, clean_text(CleanedText) as
(
SELECT SUBSTRING(@OriginalText, t.N, 1)
FROM tally t
WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
select CleanedText
from clean_text
What you posted was getting the same results (31 seconds)... I added this to the where clause:
AND n <= len(@OriginalText)[/code]
Final Version
ALTER FUNCTION [dbo].[StripNonNumeric_itvf_sean2](@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
, clean_text(CleanedText) as
(
SELECT SUBSTRING(@OriginalText, t.N, 1)
FROM tally t
WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57
AND n <= len(@OriginalText) -- added by ajb
FOR XML PATH('')
)
select CleanedText
from clean_text
GO
It now runs in Milliseconds, same speed as mine. ~30 seconds for a million rows on my system using the test harness I posted earlier.
Edit: Some of my code got turned into HTML tags. Underlined my change.
-- Itzik Ben-Gan 2001
June 25, 2014 at 1:53 pm
Alan.B (6/25/2014)
Sean Lange (6/25/2014)
Alan.B (6/25/2014)
GentsFirst: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.
Nice. I didn't see your test data and haven't had a chance to roll my own yet. I am guessing that the big difference is that my code uses STUFF and yours does not.
If you still have your test harness handy try with this.
ALTER FUNCTION [dbo].[StripNonNumeric_itvf](@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
, clean_text(CleanedText) as
(
SELECT SUBSTRING(@OriginalText, t.N, 1)
FROM tally t
WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
select CleanedText
from clean_text
What you posted was getting the same results (31 seconds)... I added this to the where clause:
AND n <= len(@OriginalText)
Final Version
ALTER FUNCTION [dbo].[StripNonNumeric_itvf_sean2](@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
, clean_text(CleanedText) as
(
SELECT SUBSTRING(@OriginalText, t.N, 1)
FROM tally t
WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57
AND n <= len(@OriginalText) -- added by ajb
FOR XML PATH('')
)
select CleanedText
from clean_text
GO
It now runs in Milliseconds, same speed as mine. ~30 seconds for a million rows on my system using the test harness I posted earlier.
Edit: Some of my code got turned into HTML tags.
That makes sense. No point evaluating all the 10,000 rows when they won't meet the condition anyway. I suspect if we changed my original (with the STUFF) it would perform much the same with the inclusion of the additional predicate in the where clause. Awesome work Alan.
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
June 25, 2014 at 2:17 pm
That makes sense. No point evaluating all the 10,000 rows when they won't meet the condition anyway. I suspect if we changed my original (with the STUFF) it would perform much the same with the inclusion of the additional predicate in the where clause. Awesome work Alan.
Thanks Sean!
-- Itzik Ben-Gan 2001
June 25, 2014 at 2:19 pm
Hi
Interestingly enough I tried a similar test on 2012 for the originals and got elapsed times that so close as to be the same. It may have been the strings I was using though.
Bring on a PATREPLACE function 😀
ChrisM's pattern splitter from Dwain's article [/url]may also be able to be used here
select cleanedstring
FROM #strings s
CROSS APPLY (
SELECT Item + '' FROM PatternSplitCM(string,'[^0-9]') WHERE MATCHED = 0 FOR XML PATH ('')
) x (cleanedstring)
October 27, 2014 at 3:07 pm
mickyT (6/25/2014)
HiInterestingly enough I tried a similar test on 2012 for the originals and got elapsed times that so close as to be the same. It may have been the strings I was using though.
Bring on a PATREPLACE function 😀
ChrisM's pattern splitter from Dwain's article [/url]may also be able to be used here
select cleanedstring
FROM #strings s
CROSS APPLY (
SELECT Item + '' FROM PatternSplitCM(string,'[^0-9]') WHERE MATCHED = 0 FOR XML PATH ('')
) x (cleanedstring)
I recently came up with an iTVF PatExclude function that removes characters based on a pattern and then I remembered this post. Check out the examples in the code...
CREATE FUNCTION dbo.PatExclude8K
(@string varchar(8000),
@exclude_pattern varchar(50)
)
RETURNS TABLE
AS
/*
-- remove letters
SELECT * FROM dbo.PatExclude8K('abc123!', '[a-z]');
-- remove numbers
SELECT * FROM dbo.PatExclude8K('abc123!', '[0-9]');
-- only include letters and numbers
SELECT * FROM dbo.PatExclude8K('###abc123!!!', '[^0-9a-z]');
-- Remove spaces
SELECT * FROM dbo.PatExclude8K('XXX 123 ZZZ', '['+char(32)+']');
-- only include letters and "!, ? or ."
SELECT * FROM dbo.PatExclude8K('123# What?!... ', '[^a-z!?.]')
*/
RETURN
WITH
E1(n) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b), --10E+2 or 100 rows
E4(n) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b), --10E+4 or 10,000 rows max
iTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
stringToTable AS
(
SELECT TOP (len(@string)) n, s = substring(@string,n,1)
FROM iTally
WHERE n <= len(@string)
),
newstring(newstring) AS
(
SELECT s+''
FROM stringToTable
WHERE s NOT LIKE @exclude_pattern
ORDER BY n
FOR XML PATH('')
)
SELECT newstring
FROM newstring;
GO
Next I tested it against PatternSplitCM for removing non-numeric....
100K row test harness:
IF OBJECT_ID('tempdb..#val') IS NOT NULL DROP TABLE #val;
CREATE TABLE #val (txt varchar(36) primary key);
INSERT INTO #val
SELECT TOP (100000) NEWID()
FROM sys.all_columns a CROSS JOIN sys.all_columns b;
GO
dbcc freeproccache;
dbcc dropcleanbuffers;
DECLARE @string varchar(50);
SET STATISTICS TIME ON;
PRINT 'PatExcludeAB:';
SELECT @string = newstring
FROM #val
CROSS APPLY dbo.PatExclude8K(txt, '[^0-9]');
PRINT 'PatternSplitCM:';
select @string = cleanedstring
FROM #val s
CROSS APPLY
(
SELECT Item + ''
FROM dbo.PatternSplitCM(txt,'[^0-9]')
WHERE MATCHED = 0 FOR XML PATH ('')
) x (cleanedstring);
SET STATISTICS TIME OFF;
GO
Results:
PatExcludeAB:
SQL Server Execution Times:
CPU time = 2250 ms, elapsed time = 2271 ms.
PatternSplitCM:
SQL Server Execution Times:
CPU time = 7109 ms, elapsed time = 7274 ms[/code]
-- Itzik Ben-Gan 2001
October 27, 2014 at 4:17 pm
Out of all of these submittals, which post has the fastest code? I'd check for myself but I'm on the run right now but want to check the "best" against an old script that I have for this.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 27, 2014 at 10:18 pm
Jeff Moden (10/27/2014)
Out of all of these submittals, which post has the fastest code? I'd check for myself but I'm on the run right now but want to check the "best" against an old script that I have for this.
Basically we have four:
The first solution Sean posted, my solution using nGrams, Micky's using paternSplitCM and the one I posted today. I wrapped them all in functions:
-----------------------------------------------------------------------
-- (1) Sean's origninal solution
-----------------------------------------------------------------------
CREATE FUNCTION dbo.StripNonNumeric_itvf(@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
Tally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select STUFF(
(
SELECT SUBSTRING(@OriginalText, t.N, 1)
FROM tally t
WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57
FOR XML PATH('')
), 1 ,0 , '') as CleanedText;
-----------------------------------------------------------------------
-- (2) My Solution using nGrams8K
-----------------------------------------------------------------------
;-- nGrams
CREATE FUNCTION [dbo].[nGrams8K]
(
@string VARCHAR(8000),
@n TINYINT,
@pad BIT=0
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(n)),
E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(n) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b),
iTally(n) AS
(
SELECT TOP (len(@string)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4
),
NewString(NewString) AS
(
SELECTREPLICATE(CASE @pad WHEN 0 THEN '' ELSE ' ' END,@n-1)+@string+
REPLICATE(CASE @pad WHEN 0 THEN '' ELSE ' ' END,@n-1)
)
SELECT n AS sequence,
SUBSTRING(NewString,n,@n) AS token
FROM iTally
CROSS APPLY NewString
WHERE n < ((@n)+LEN(@string));
GO
-- stripNonNumeric_itvf_ajb
CREATE FUNCTION dbo.StripNonNumeric_itvf_ajb(@OriginalText VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
return
WITH ngrams AS
(
SELECT n = [sequence],
c = token
FROM dbo.nGrams8K(@OriginalText,1,1)
),
clean_txt(CleanedText) AS
(
SELECT c+''
FROM ngrams
WHERE ascii(substring(@OriginalText,n,1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
SELECT CleanedText
FROM clean_txt;
GO
-----------------------------------------------------------------------
-- (3) Solution Using PatExclude8K
-----------------------------------------------------------------------
CREATE FUNCTION dbo.PatExclude8K
(@string varchar(8000),
@exclude_pattern varchar(50)
)
RETURNS TABLE
AS
RETURN
WITH
E1(n) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b), --10E+2 or 100 rows
E4(n) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b), --10E+4 or 10,000 rows max
iTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
stringToTable AS
(
SELECT TOP (len(@string)) n, s = substring(@string,n,1)
FROM iTally
WHERE n <= len(@string)
),
newstring(newstring) AS
(
SELECT s+''
FROM stringToTable
WHERE s NOT LIKE @exclude_pattern
ORDER BY n
FOR XML PATH('')
)
SELECT newstring
FROM newstring;
GO
-----------------------------------------------------------------------
-- (4) Using PatternSplitCM
-----------------------------------------------------------------------
CREATE FUNCTION dbo.StripNonNumeric_itvf_pscm(@OriginalText VARCHAR(8000))
RETURNS TABLE
--WITH SCHEMABINDING AS
return
WITH newstring(newstring)
AS
(
SELECT item+''
FROM PatternSplitCM(@OriginalText, '[^0-9]')
WHERE [matched] = 0
FOR XML PATH('')
)
SELECT newstring
FROM newstring;
GO
-----------------------------------------------------------------------
-- Make sure they work
-----------------------------------------------------------------------
SELECT *
FROM dbo.StripNonNumeric_itvf('abc123!');
SELECT *
FROM dbo.StripNonNumeric_itvf_pscm('abc123!');
SELECT *
FROM dbo.StripNonNumeric_itvf_ajb('abc123!');
SELECT *
FROM dbo.PatExclude8K('abc123!', '[^0-9]');
then created this test harness...
-----------------------------------------------------------------------
-- 100K ROW TEST HARNESS
-----------------------------------------------------------------------
IF OBJECT_ID('tempdb..#val') IS NOT NULL DROP TABLE #val;
CREATE TABLE #val (txt varchar(36) primary key);
INSERT INTO #val
SELECT TOP (100000) NEWID()
FROM sys.all_columns a CROSS JOIN sys.all_columns b;
GO
set nocount on;
dbcc freeproccache;
dbcc dropcleanbuffers;
DECLARE @string varchar(50);
SET STATISTICS TIME ON;
PRINT 'Original (skipped; runs 30+ seconds for 10K rows):';
--SELECT CleanedText
--FROM #val
--CROSS APPLY dbo.StripNonNumeric_itvf(txt);
PRINT 'Using PatternSplitCM:';
SELECT @string = newstring
FROM #val
CROSS APPLY dbo.StripNonNumeric_itvf_pscm(txt);
PRINT 'Using nGrams:';
SELECT @string = CleanedText
FROM #val
CROSS APPLY dbo.StripNonNumeric_itvf_ajb(txt);
PRINT 'Using PatExclude8K:';
SELECT @string = newstring
FROM #val
CROSS APPLY dbo.PatExclude8K(txt, '[^0-9]');
SET STATISTICS TIME OFF;
GO
the results...
Original (skipped; runs 30+ seconds for 10K rows):
Using PatternSplitCM:
SQL Server Execution Times:
CPU time = 7187 ms, elapsed time = 7239 ms.
Using nGrams:
SQL Server Execution Times:
CPU time = 2594 ms, elapsed time = 2591 ms.
PatExclude8K:
SQL Server Execution Times:
CPU time = 2266 ms, elapsed time = 2276 ms.
The results were pretty much the same when tested against hot or cold cache.
Note: the first 2 functions are the original code from a couple months back.
Edit: fixed print statement, added note
-- Itzik Ben-Gan 2001
October 28, 2014 at 12:34 am
I took the two contenders from your good post, Alan, and added on of my own. Here's the code for the function I use for such things. Yeah... you'll be shocked. It's not only a scalar function but it also has a WHILE loop in it.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[CleanString]
/*******************************************************************************
Purpose:
Given a string and a pattern of characters to remove, remove the patterned
characters from the string.
Usage:
--===== Basic Syntax Example
SELECT CleanedString = dbo.CleanString(@pSomeString,@pPattern)
;
--===== Remove all but Alpha characters
SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^A-Za-z]%');
FROM dbo.SomeTable st
;
--===== Remove all but Numeric digits
SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^0-9]%');
FROM dbo.SomeTable st
;
Programmer Notes:
1. @pPattern is case sensitive.
2. The pattern set of characters must be for just one character.
Revision History:
Rev 00 - Circa 2007 - George Mastros?
- Initial find on the web
Rev 01 - 29 Mar 2007 - Jeff Moden
- Optimize to remove one instance of PATINDEX from the loop.
- Add code to use the pattern as a parameter.
Rev 02 - 26 May 2013 - Jeff Moden
- Add case sensitivity
*******************************************************************************/
(@pString VARCHAR(8000),@pPattern VARCHAR(100))
RETURNS VARCHAR(8000) AS
BEGIN
DECLARE @Pos SMALLINT;
SELECT @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);
WHILE @Pos > 0
SELECT @pString = STUFF(@pString,@Pos,1,''),
@Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);
RETURN @pString;
END
;
Here's the test harness that I used. It runs each function through the 100K row table 3 times.
--===== Create the 100K row test table
IF OBJECT_ID('tempdb..#val') IS NOT NULL DROP TABLE #val
;
SELECT TOP 100000
txt = ISNULL(CONVERT(VARCHAR(36),NEWID()),'')
INTO #val
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
ALTER TABLE #Val
ADD PRIMARY KEY CLUSTERED (txt)
;
--===== Do the tests. Had to use duration because one
-- of the tests is on the new scalar function and
-- SET STATISTICS doesn't report on those correctly.
GO
PRINT '========== Using nGrams ==========';
DECLARE @String VARCHAR(36)
,@StartTime DATETIME
;
SELECT @StartTime = GETDATE()
;
SELECT @string = CleanedText
FROM #val
CROSS APPLY dbo.StripNonNumeric_itvf_ajb(txt)
;
PRINT DATEDIFF(ms,@StartTime,GETDATE())
;
GO 3
PRINT '========== Using PatExclude8K ==========';
DECLARE @String VARCHAR(36)
,@StartTime DATETIME
;
SELECT @StartTime = GETDATE()
;
SELECT @string = newstring
FROM #val
CROSS APPLY dbo.PatExclude8K(txt, '[^0-9]')
;
PRINT DATEDIFF(ms,@StartTime,GETDATE())
;
GO 3
PRINT '========== Jeff''s Old Scalar Loop Function ==========';
DECLARE @String VARCHAR(36)
,@StartTime DATETIME
;
SELECT @StartTime = GETDATE()
;
SELECT @String = dbo.CleanString(txt,'%[^0-9]%')
FROM #val
;
PRINT DATEDIFF(ms,@StartTime,GETDATE())
;
GO 3
Here are the run results. Again, you'll be shocked. This is the one place where I've not been able to make a Tally Table solution able to beat it. Lord knows I and other good folks have tried.
(100000 row(s) affected)
Beginning execution loop
========== Using nGrams ==========
2916
========== Using nGrams ==========
2893
========== Using nGrams ==========
2890
Batch execution completed 3 times.
Beginning execution loop
========== Using PatExclude8K ==========
2703
========== Using PatExclude8K ==========
2640
========== Using PatExclude8K ==========
2653
Batch execution completed 3 times.
Beginning execution loop
========== Jeff's Old Scalar Loop Function ==========
2413
========== Jeff's Old Scalar Loop Function ==========
2500
========== Jeff's Old Scalar Loop Function ==========
2466
Batch execution completed 3 times.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 28, 2014 at 2:55 am
Here is my attempt from the train journey to work this morning, looks like it will give the while loop a real run for the money.
😎
/********************************************************************
-- Stripping out any non-numerical characters
-- EE 2014-10-28
********************************************************************/
CREATE FUNCTION dbo.STRIP_NUM_EE
(
@INSTR VARCHAR(8000)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH T(N) AS (SELECT N FROM ( VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))
,NUMS(N) AS (SELECT TOP(CONVERT(BIGINT,LEN(@INSTR),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM T T1,T T2,T T3,T T4)
SELECT
(
SELECT
CASE WHEN CONVERT(CHAR(1),(ASCII(SUBSTRING(@INSTR COLLATE Latin1_General_BIN,NM.N,1)) - 48),0)
= SUBSTRING(@INSTR COLLATE Latin1_General_BIN,NM.N,1) THEN SUBSTRING(@INSTR,NM.N,1) END
FROM NUMS NM
FOR XML PATH('')
) AS OUT_STR
;
Test results
Beginning execution loop
========== Using nGrams ==========
1640
========== Using nGrams ==========
1640
========== Using nGrams ==========
1640
Batch execution completed 3 times.
Beginning execution loop
========== Using PatExclude8K ==========
2686
========== Using PatExclude8K ==========
2730
========== Using PatExclude8K ==========
2700
Batch execution completed 3 times.
Beginning execution loop
========== Jeff's Old Scalar Loop Function ==========
1556
========== Jeff's Old Scalar Loop Function ==========
1563
========== Jeff's Old Scalar Loop Function ==========
1546
Batch execution completed 3 times.
Beginning execution loop
========== dbo.STRIP_NUM_EE Function ==========
1490
========== dbo.STRIP_NUM_EE Function ==========
1513
========== dbo.STRIP_NUM_EE Function ==========
1473
Batch execution completed 3 times.
Viewing 15 posts - 1 through 15 (of 45 total)
You must be logged in to reply to this topic. Login to reply