May 9, 2011 at 3:33 pm
I could easily write it in the form of the non-RBAR, but the performance would not be as fast. How many times do we pass through the loop here ? 3 times - the number of words, equals number of letters in the result - a very low number. It's because most of the "looping" is done within SQL's character searching functions like PATINDEX or CHARINDEX. So, there is a case where plain RBAR outperforms non-rbar, and even xml. Not just outperforms, but shows greater functionality (multiseparators), and versatility (works across all versions of sql server since 2000). So, it beats all other solutions in all fields.
Rules are excelent things to know. When you are the master of the rules - break them! (but not before that 😉 )
May 9, 2011 at 5:22 pm
Vedran Kesegic (5/9/2011)
I could easily write it in the form of the non-RBAR, but the performance would not be as fast.
You have test code that proves that, right?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2011 at 5:30 pm
peterzeke (5/6/2011)
I'm trying to figure out how to concatenate the first letter of each word within a given string. It's bit like parsing a delimited list, I'd imagine. Nonetheless, I find an elegant solution a tad elusive.Here's an example:
String = "COMMUNITY GENERAL HOSPITAL"
I'd like to return "CGH"
I've seen various posts/articles that discuss parsing a delimited list into separat rows, but I haven't quite figured out how to return a single row as a concatenated string.
Thanks in advance for assistance.
--Pete
Pete,
Are all "words" guaranteed to have a space character between them? I ask because it's a special case and we can build some code to make it run faster than any splitter code we've used so far.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2011 at 8:20 pm
Jeff Moden (5/9/2011)
peterzeke (5/6/2011)
I'm trying to figure out how to concatenate the first letter of each word within a given string. It's bit like parsing a delimited list, I'd imagine. Nonetheless, I find an elegant solution a tad elusive.Here's an example:
String = "COMMUNITY GENERAL HOSPITAL"
I'd like to return "CGH"
I've seen various posts/articles that discuss parsing a delimited list into separat rows, but I haven't quite figured out how to return a single row as a concatenated string.
Thanks in advance for assistance.
--Pete
Pete,
Are all "words" guaranteed to have a space character between them? I ask because it's a special case and we can build some code to make it run faster than any splitter code we've used so far.
Jeff -- nearly all cases will have a space between the "words"; a few cases might have a comma when the "name" has been flip-flopped:
Example: "Hospital, Community General" instead of "Community General Hospital". (A one-time update to the Dim_HealthProviders table to "fix" the reversed-named descriptions would provide conformity such that only a space exists between the words.)
So, regardless, if you believe you have a batch of SQL custom tailored to space-delimited-only strings that will be "faster than any splitter code" thus far, I'd absolutely love to see it (and I'm sure other people would, too)!
May 9, 2011 at 8:41 pm
Jeff Moden (5/9/2011)
Vedran Kesegic (5/9/2011)
I could easily write it in the form of the non-RBAR, but the performance would not be as fast.You have test code that proves that, right?
If explanation about very few loop iterations was not enough for you, here is a complete test case:
create function dbo.str_FirstLetterOfWords_RBAR
(@string VARCHAR(8000)
) returns varchar(100) as
begin
DECLARE @allowed varchar(100); SET @allowed = 'A-Z0-9' -- characters allowed in the result
DECLARE @i INT; SET @i = 0
DECLARE @result varchar(100)
WHILE @i is not null
BEGIN
SET @result = ISNULL(@result,'')+ISNULL(SUBSTRING(@string,@i+1,1),'')
SET @i = @i + NULLIF(PATINDEX('%[^('+@allowed+')]['+@allowed+']%',SUBSTRING(@string,@i+1,8000)),0)
END
return @result
end
GO
create function dbo.str_FirstLetterOfWords_NONRBAR
(@string VARCHAR(8000)
) returns varchar(100) as
begin
DECLARE @allowed varchar(100); SET @allowed = 'A-Z0-9' -- characters allowed in the result
DECLARE @result varchar(100); SET @result = ''
select @result = @result + case when n.n = 0 or substring(@string,n.n,2) like '%[^('+@allowed+')]['+@allowed+']%' then substring(@string,n.n+1,1) else '' end
from systally n
where n.n < len(@string)
return @result
end
GO
set statistics time on
declare @string VARCHAR(8000) SET @string = REPLICATE('A',2000)+' '+REPLICATE('B',4000)+' '+REPLICATE('C',2000)
PRINT 'NONRBAR: '+dbo.str_FirstLetterOfWords_NONRBAR(@string)
PRINT 'RBAR: '+dbo.str_FirstLetterOfWords_RBAR(@string)
Result:
NONRBAR 7ms
RBAR 1ms
In this example we have RBAR that is 700% faster than set operation (non-rbar).
The reason is simple: overhead of row-by-row is not significant here because of low number of loop iterations (iterates per word) compared to high number of internal sql loop iterations (iterates per character).
Most of the time set operations will perform better than rbar, but there are special cases (like the example shown here) where the truth is just the opposite. Person should not stick to the dogma, but understand truly what is one and what is the other method, and what is happening "under the hood". Then the person is free to choose the most appropriate method in a given situation. One may easily predict what will happen with the performance here, but measuring gives you a real proof.
May 9, 2011 at 9:34 pm
Vedran Kesegic (5/9/2011)
Jeff Moden (5/9/2011)
Vedran Kesegic (5/9/2011)
I could easily write it in the form of the non-RBAR, but the performance would not be as fast.You have test code that proves that, right?
If explanation about very few loop iterations was not enough for you, here is a complete test case:
create function dbo.str_FirstLetterOfWords_RBAR
(@string VARCHAR(8000)
) returns varchar(100) as
begin
DECLARE @allowed varchar(100); SET @allowed = 'A-Z0-9' -- characters allowed in the result
DECLARE @i INT; SET @i = 0
DECLARE @result varchar(100)
WHILE @i is not null
BEGIN
SET @result = ISNULL(@result,'')+ISNULL(SUBSTRING(@string,@i+1,1),'')
SET @i = @i + NULLIF(PATINDEX('%[^('+@allowed+')]['+@allowed+']%',SUBSTRING(@string,@i+1,8000)),0)
END
return @result
end
GO
create function dbo.str_FirstLetterOfWords_NONRBAR
(@string VARCHAR(8000)
) returns varchar(100) as
begin
DECLARE @allowed varchar(100); SET @allowed = 'A-Z0-9' -- characters allowed in the result
DECLARE @result varchar(100); SET @result = ''
select @result = @result + case when n.n = 0 or substring(@string,n.n,2) like '%[^('+@allowed+')]['+@allowed+']%' then substring(@string,n.n+1,1) else '' end
from systally n
where n.n < len(@string)
return @result
end
GO
set statistics time on
declare @string VARCHAR(8000) SET @string = REPLICATE('A',2000)+' '+REPLICATE('B',4000)+' '+REPLICATE('C',2000)
PRINT 'NONRBAR: '+dbo.str_FirstLetterOfWords_NONRBAR(@string)
PRINT 'RBAR: '+dbo.str_FirstLetterOfWords_RBAR(@string)
Result:
NONRBAR 7ms
RBAR 1ms
In this example we have RBAR that is 700% faster than set operation (non-rbar).
The reason is simple: overhead of row-by-row is not significant here because of low number of loop iterations (iterates per word) compared to high number of internal sql loop iterations (iterates per character).
Most of the time set operations will perform better than rbar, but there are special cases (like the example shown here) where the truth is just the opposite. Person should not stick to the dogma, but understand truly what is one and what is the other method, and what is happening "under the hood". Then the person is free to choose the most appropriate method in a given situation. One may easily predict what will happen with the performance here, but measuring gives you a real proof.
You claim 700% faster yet I see no test data. Where's the test data? And please stop lecturing about dogma.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 9, 2011 at 9:38 pm
peterzeke (5/9/2011)
Jeff Moden (5/9/2011)
peterzeke (5/6/2011)
I'm trying to figure out how to concatenate the first letter of each word within a given string. It's bit like parsing a delimited list, I'd imagine. Nonetheless, I find an elegant solution a tad elusive.Here's an example:
String = "COMMUNITY GENERAL HOSPITAL"
I'd like to return "CGH"
I've seen various posts/articles that discuss parsing a delimited list into separat rows, but I haven't quite figured out how to return a single row as a concatenated string.
Thanks in advance for assistance.
--Pete
Pete,
Are all "words" guaranteed to have a space character between them? I ask because it's a special case and we can build some code to make it run faster than any splitter code we've used so far.
Jeff -- nearly all cases will have a space between the "words"; a few cases might have a comma when the "name" has been flip-flopped:
Example: "Hospital, Community General" instead of "Community General Hospital". (A one-time update to the Dim_HealthProviders table to "fix" the reversed-named descriptions would provide conformity such that only a space exists between the words.)
So, regardless, if you believe you have a batch of SQL custom tailored to space-delimited-only strings that will be "faster than any splitter code" thus far, I'd absolutely love to see it (and I'm sure other people would, too)!
Perfect. Thanks, Pete. I'll have to take a gander at this tomorrow.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 10, 2011 at 1:03 am
You claim 700% faster yet I see no test data.
Jeff, did you run the script ?
May 10, 2011 at 7:42 am
Vedran Kesegic (5/10/2011)
You claim 700% faster yet I see no test data.
Jeff, did you run the script ?
Yep... 1 row does not constitute adequate test data. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
May 10, 2011 at 8:51 am
Vedran Kesegic (5/9/2011)
If explanation about very few loop iterations was not enough for you, here is a complete test case:
1 Row of test data simply doesn't constitute a "complete" test case.
Your RBAR code is very well done and quite fast. But the code that you claimed is "set based" is not set-based by any stretch of the imagination. Because of your misunderstanding of what set based is, you've written the code as a scalar function which is a form of Hidden RBAR.
One may easily predict what will happen with the performance here, but measuring gives you a real proof.
I agree... one REAL test is better than a thousand expert opinions. With that thought in mind, lets do some more significant testing which will also test the scalar interfaces you've built.
Here's the code to build my standard of a million rows of test data:
--===== Do this in a nice safe place that everyone has.
USE tempdb;
GO
--===== Conditionally drop the test table to make reruns easier in SSMS
IF OBJECT_ID('tempdb..#Names','U') IS NOT NULL DROP TABLE #Names;
--===== This does a cross join between the name parts to produce more than the original 19,997 rows.
-- In fact, it creates a million rows. Because MiddleName is frequently null, there will be dupes
-- but that's also part of the test.
SELECT RowNum = IDENTITY(INT,1,1),
String = RTRIM(ISNULL(fname.FirstName+' ','')+ISNULL(mname.MiddleName+' ','')+ISNULL(lname.LASTNAME,''))
INTO #Names
FROM (SELECT TOP 100 FirstName FROM AdventureWorks.Person.Contact) fname
CROSS JOIN
(SELECT TOP 100 MiddleName FROM AdventureWorks.Person.Contact) mname
CROSS JOIN
(SELECT TOP 100 LastName FROM AdventureWorks.Person.Contact) lname
;
Now, let's build your two functions and test them against the million row test table:
--===== Do this in a nice safe place that everyone has.
USE tempdb;
GO
--===== Create Vedran's RBAR function
IF OBJECT_ID('tempdb.dbo.str_FirstLetterOfWords_RBAR') IS NOT NULL
DROP FUNCTION dbo.str_FirstLetterOfWords_RBAR
;
GO
create function dbo.str_FirstLetterOfWords_RBAR
( @string VARCHAR(8000)
) returns varchar(100) as
begin
DECLARE @allowed varchar(100); SET @allowed = 'A-Z0-9' -- characters allowed in the result
DECLARE @i INT; SET @i = 0
DECLARE @result varchar(100)
WHILE @i is not null
BEGIN
SET @result = ISNULL(@result,'')+ISNULL(SUBSTRING(@string,@i+1,1),'')
SET @i = @i + NULLIF(PATINDEX('%[^('+@allowed+')]['+@allowed+']%',SUBSTRING(@string,@i+1,8000)),0)
END
return @result
end
GO
--===== Create the function Vedran claims to be "Non-RBAR" but actually is because it's a scalar function
IF OBJECT_ID('tempdb.dbo.str_FirstLetterOfWords_NONRBAR') IS NOT NULL
DROP FUNCTION dbo.str_FirstLetterOfWords_NONRBAR
;
GO
create function dbo.str_FirstLetterOfWords_NONRBAR
(@string VARCHAR(8000)
) returns varchar(100) as
begin
DECLARE @allowed varchar(100); SET @allowed = 'A-Z0-9' -- characters allowed in the result
DECLARE @result varchar(100); SET @result = ''
select @result = @result + case when n.n = 0 or substring(@string,n.n,2) like '%[^('+@allowed+')]['+@allowed+']%' then substring(@string,n.n+1,1) else '' end
from tally n --Changed this to match the name of my Tally Table
where n.n < len(@string)
return @result
end
GO
--=====================================================================================================================
-- Give Vedran's functions a chance by using them to populate a table instead of the overhead of printing to
-- the screen. This would be more like what real-life would demand from a million row conversion.
-- We also measure simple duration because setting STATISTICS ON just kills performance when RBAR is present.
--=====================================================================================================================
--===== Test the RBAR function ========================================================================================
IF OBJECT_ID('tempdb..#Results','U') IS NOT NULL DROP TABLE #Results
;
PRINT '--===== Test the RBAR function ========================================================================================'
;
DECLARE @StartTime DATETIME;
SELECT @StartTime = GETDATE()
;
SELECT String, dbo.str_FirstLetterOfWords_RBAR(String) AS Abbreviation
INTO #Results
FROM #Names
;
PRINT 'Total Duration in Seconds: ' + CAST(DATEDIFF(ss,@StartTime,GETDATE()) AS VARCHAR(10))
;
GO
--===== Test the supposed NON RBAR function ========================================================================================
IF OBJECT_ID('tempdb..#Results','U') IS NOT NULL DROP TABLE #Results
;
PRINT '--===== Test the NON RBAR function ========================================================================================'
;
DECLARE @StartTime DATETIME;
SELECT @StartTime = GETDATE()
;
SELECT String, dbo.str_FirstLetterOfWords_NONRBAR(String) AS Abbreviation
INTO #Results
FROM #Names
;
PRINT 'Total Duration in Seconds: ' + CAST(DATEDIFF(ss,@StartTime,GETDATE()) AS VARCHAR(10))
;
GO
In this example we have RBAR that is 700% faster than set operation (non-rbar).
...
Result:
NONRBAR 7ms
RBAR 1ms
Hmmm... you claim 700%. Here's what I get as a result from those two tests.
[font="Courier New"]--===== Test the RBAR function ========================================================================================
(1000000 row(s) affected)
Total Duration in Seconds: 48
--===== Test the NON RBAR function ========================================================================================
(1000000 row(s) affected)
Total Duration in Seconds: 99 [/font]
That's not a 700% difference by any means. Of course, it doesn't really matter because the "NONRBAR" function you wrote is still RBAR. Let's see what a real set-based function can do. Here's the code to build and test the set based function:
--===== Do this in a nice safe place that everyone has.
USE tempdb;
GO
--===== Build a real set-based function to find the first letter of each word using an iTVF.
IF OBJECT_ID('tempdb.dbo.Abbreviate') IS NOT NULL
DROP FUNCTION dbo.Abbreviate
;
GO
CREATE FUNCTION dbo.Abbreviate
--===== Define I/O parameters
(@pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL -- does away with 0 base CTE, and the OR condition in one go!
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,2) COLLATE Latin1_General_BIN LIKE ' [A-Z0-9]' COLLATE Latin1_General_BIN
)
--===== Do the actual split of each character following a space and concatenate it.
SELECT Abbreviation = (SELECT SUBSTRING(@pString, s.N1,1) FROM cteStart s FOR XML PATH(''))
;
GO
--=====================================================================================================================
-- Now, test the real set based function the same way we tested Vedran's
--=====================================================================================================================
--===== Test the RBAR function ========================================================================================
IF OBJECT_ID('tempdb..#Results','U') IS NOT NULL DROP TABLE #Results
;
PRINT '--===== Test the real set-based function ========================================================================================'
;
DECLARE @StartTime DATETIME;
SELECT @StartTime = GETDATE()
;
SELECT String, abbv.Abbreviation
INTO #Results
FROM #Names n
CROSS APPLY dbo.Abbreviate(n.String) abbv
;
PRINT 'Total Duration in Seconds: ' + CAST(DATEDIFF(ss,@StartTime,GETDATE()) AS VARCHAR(10))
;
GO
Here's the result I get from that bit of computational heaven:
[font="Courier New"]--===== Test the real set-based function ========================================================================================
(1000000 row(s) affected)
Total Duration in Seconds: 37 [/font]
Let's look at those results altogether to make it easier to compare:
[font="Courier New"]--===== Test the RBAR function ========================================================================================
(1000000 row(s) affected)
Total Duration in Seconds: 48
--===== Test the NON RBAR function ========================================================================================
(1000000 row(s) affected)
Total Duration in Seconds: 99
--===== Test the real set-based function ========================================================================================
(1000000 row(s) affected)
Total Duration in Seconds: 37 [/font]
If you do the simple math, the real set-based function handily beats the RBAR function by ~23%.
Person should not stick to the dogma, but understand truly what is one and what is the other method, and what is happening "under the hood".
At this point in time, my recommendation would be that you should practice what you preach before taking to preaching. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
May 10, 2011 at 8:56 am
Pete,
Your code is in the reply above. Look for the "dbo.Abbreviate" function.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 10, 2011 at 9:15 am
FYI, here are the results on my system (with my code throw in to test also):
--===== Test the RBAR function ========================================================================================
Total Duration in Seconds: 12
--===== Test the NON RBAR function ========================================================================================
Total Duration in Seconds: 26
--===== Test the real set-based function ========================================================================================
Total Duration in Seconds: 6
--===== Test Wayne's real set-based solution ========================================================================================
Total Duration in Seconds: 9
Results speak for themselves :Whistling:
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 10, 2011 at 9:24 am
Million thanks, Jeff. I look forward to trying it out your sql solution.
As I'm sure you've heard from many others before, your thoroughness of tackling a problem is really something to behold. I wish I had the time at work to be so complete when developing a solution and testing it. (A quick look at my email in-box shows I stopped reading SSC newsletters back in March! (still marked as "unread") -- except when the newsletter subject line read "Tally Oh!" -- of course, I knew to take moment to read that!) Finding time for R&D is tough these days...
Thanks again,
--Pete
May 10, 2011 at 9:29 am
peterzeke (5/10/2011)
Million thanks, Jeff. I look forward to trying it out your sql solution.As I'm sure you've heard from many others before, your thoroughness of tackling a problem is really something to behold. I wish I had the time at work to be so complete when developing a solution and testing it. (A quick look at my email in-box shows I stopped reading SSC newsletters back in March! (still marked as "unread") -- except when the newsletter subject line read "Tally Oh!" -- of course, I knew to take moment to read that!) Finding time for R&D is tough these days...
Thanks again,
--Pete
I'm sure you know this Pete, but throwing in a plug anyways: If you MAKE time for R&D it pays big dividends. Can't count the number of times I've been working through something myself or with teammates and said "Wait a minute, I remember <insert name here> talking about something like this..." (cue search for previous article/forum post) and then found a solution that improves our process by several orders of magnitude.
Standing on the shoulders of giants, and all that.
---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]
"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."
May 10, 2011 at 10:27 am
jcrawf02 (5/10/2011)
I'm sure you know this Pete, but throwing in a plug anyways: If you MAKE time for R&D it pays big dividends. Can't count the number of times I've been working through something myself or with teammates and said "Wait a minute, I remember <insert name here> talking about something like this..." (cue search for previous article/forum post) and then found a solution that improves our process by several orders of magnitude.
Standing on the shoulders of giants, and all that.
I agree 100% -- I've pushed many times that we need to spend the resources to build a "pipeline" instead of maintaining the "bucket brigade."
Nonetheless, I'm typically the R&D guy for my team, while passing along the golden nuggets other gurus have unearthed. But a couple of months ago a few people from my department left the company. Thus, the IT group is now a skeleton crew without much ability to back each other up these days. If someone gets hit by the proverbial bus any time soon we'll be in a really bad place.... The company doesn't have the funds to refill the positions that were recently vacated, either.
So, for the time being, we do the best we can, when we can. One of my current projects, however, caused me to pause long enough to post a plea for help as evidenced by this discussion thread. Fortunately, SSC has a terrific community of expert helpers!
Viewing 15 posts - 16 through 30 (of 56 total)
You must be logged in to reply to this topic. Login to reply