May 11, 2011 at 7:36 am
Mark-101232 (5/11/2011)
For completeness, here's a simple SQLCLR to perform the same task, appears to perform okay.
Now we know part of the reason why I usually provide a mountain of generated test data... I love speed! Thanks for the code and for running the test, Mark. π
--Jeff Moden
Change is inevitable... Change for the better is not.
May 11, 2011 at 7:47 am
garima.khandelwal (5/11/2011)
T-SQL program fetching the first character of each word in the given string :DECLARE @STRING VARCHAR(8000);
DECLARE @Pos int =1;
DECLARE @FINALSTRING VARCHAR(8000) = '';
DECLARE @COUNTER int = 0;
DECLARE @DELIMINATOR VARCHAR(1) = ' ';
SET @STRING = 'COMMUNITY GENERAL HOSPITAL';
/*Finds the last occurrence of the delimitated character*/
SET @COUNTER = LEN(@STRING) - CHARINDEX(@DELIMINATOR,REVERSE(@STRING))
/*Fetches first character*/
SET @FINALSTRING = SUBSTRING(@STRING,@Pos,1)
WHILE ( @Pos <= @COUNTER)
BEGIN
/*Finds the other occurrences of the delimitated character*/
SET @Pos = CHARINDEX(@DELIMINATOR,@STRING,@Pos+1)
SET @FINALSTRING = @FINALSTRING + substring(@STRING,@Pos+1,1);
END
SELECT @FINALSTRING
Just a reminder... this particular thread is on an SQL Server 2005 forum. The SQL Server 2008 declaration tricks you used won't work in SQL Server 2005.
Also, the delimiter isn't actually a space character, in this case. Rather it's the second character of a space followed by either a letter or a digit. If we use your code (converted for proper operation in 2005) with a string that has extra spaces in it, you'll see what I mean...
DECLARE @STRING VARCHAR(8000);
DECLARE @Pos INT
SELECT @Pos =1;
DECLARE @FINALSTRING VARCHAR(8000)
SELECT @FINALSTRING= '';
DECLARE @COUNTER int
SELECT @Counter = 0;
DECLARE @DELIMINATOR VARCHAR(1)
SELECT @Deliminator = ' ';
SET @STRING = 'COMMUNITY GENERAL HOSPITAL';
/*Finds the last occurrence of the delimitated character*/
SET @COUNTER = LEN(@STRING) - CHARINDEX(@DELIMINATOR,REVERSE(@STRING))
/*Fetches first character*/
SET @FINALSTRING = SUBSTRING(@STRING,@Pos,1)
WHILE ( @Pos <= @COUNTER)
BEGIN
/*Finds the other occurrences of the delimitated character*/
SET @Pos = CHARINDEX(@DELIMINATOR,@STRING,@Pos+1)
SET @FINALSTRING = @FINALSTRING + substring(@STRING,@Pos+1,1);
END
SELECT @FINALSTRING
Results:
-------------------
C GH
(1 row(s) affected)
Also, the use of a WHILE loop (especially with the rather expensive REVERSE function) is one of the slowest ways to do a split. Please see the following article for a huge amount of testing on that...
http://www.sqlservercentral.com/articles/Tally+Table/72993/
--Jeff Moden
Change is inevitable... Change for the better is not.
May 11, 2011 at 8:03 am
peterzeke (5/11/2011)
Jeff Moden (5/10/2011)
Pete,Based on the findings in the post above, will you ever have a condition where the first character isn't a letter?
Jeff -- the first character should just be a letter.
By the way, by "first character" I'm referring to the first character of the entire string, not the first character of every word within the string.
(There is one exception where the healthcare provider's name is "1-800-contacts", but this provider will never play a role in my analyses, at least regarding a need to create abbreviations.)
Given the sketchy quality of healthcare data that we receive, however, it's possible something could slip by our scrubbing routines that we've never encountered before. If you've ever worked with healthcare data, I'm sure you know what I'm talking about.
Oh yeah. It has always amazed me at how bad the condition of the data can be for something so very important. I've also seen hyphenated words and a whole bunch of other stuff that may affect abbreviations for you. I'm getting ready to haul it into work so I'll have to get back to this tonight but we can modify the code to handle even those eventualities and also make it so that "1-800-contacts" returns "18c".
--Jeff Moden
Change is inevitable... Change for the better is not.
May 11, 2011 at 8:41 am
Pete, curious why you're doing this? Won't you have to explain the abbreviations to users anyway? What about multiples/duplicates (Mercy Hospital / Mercer Hospital / Mary's Hospital/ etc?) Are you going to pair the abbrev with the NPI or something to "uniquify" it?
---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]
"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."
May 11, 2011 at 9:26 am
Jeff Moden (5/11/2011)
Also, the delimiter isn't actually a space character, in this case. Rather it's the second character of a space followed by either a letter or a digit. If we use your code (converted for proper operation in 2005) with a string that has extra spaces in it, you'll see what I mean...Results:
-------------------
C GH
(1 row(s) affected)
Aw, now that is something that I didn't consider in the code that I posted above either. Good catch Jeff.
Please see the following article for a huge amount of testing on that... http://www.sqlservercentral.com/articles/Tally+Table/72993/
Yes there was quite a bit of testing on that... both for the article, and then again afterwards in the discussion. You should read both!
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
May 11, 2011 at 9:43 am
jcrawf02 (5/11/2011)
Pete, curious why you're doing this? Won't you have to explain the abbreviations to users anyway? What about multiples/duplicates (Mercy Hospital / Mercer Hospital / Mary's Hospital/ etc?) Are you going to pair the abbrev with the NPI or something to "uniquify" it?
I'm working on developing a patient dashboard. One of the requirements is to display the facility name of when lab/screening/test value results are not available (i.e., patient had an LDL on 1/1/2011 but the result is not available). In those instances, our case managers can contact the facilities to obtain results. I'm ultimately contending with a "real estate" issue regarding how to fit everything on a single screen/page, especially since the requirement for displaying lab data involves showing the past four dates of lab services. Fitting lab/screening type, lab date, and lab result side-by-side (i.e., like a pivot table with lab test names on the left-most column, and lab dates/results on columns) is a tight fit. Adding in the name of the facility makes it that much harder, so abbreviations should dodge the real estate issue, especially when forcing the facility name in the lab value field if the lab value is null/empty.
If it seems I'm running into abbreviation confusion as you indicated, then I'll test whether taking the first "word" of the facility combined with the abbreviation will resolve any confusion. Also, I'll likely test as to whether having an abbreviation key in the page footer will provide clarity.
May 11, 2011 at 10:00 am
Totally understand the real estate issue. Bet they want to print it out too, right? π
We've been working on a patient dashboard ourselves but haven't gotten the lab values in there yet, I'll PM you with what we have, maybe we can bounce ideas off each other to mutual advantage.
---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]
"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."
May 11, 2011 at 2:47 pm
Thank you all π
May 11, 2011 at 8:26 pm
peterzeke (5/11/2011)
Jeff Moden (5/10/2011)
Pete,Based on the findings in the post above, will you ever have a condition where the first character isn't a letter?
Jeff -- the first character should just be a letter.
By the way, by "first character" I'm referring to the first character of the entire string, not the first character of every word within the string.
(There is one exception where the healthcare provider's name is "1-800-contacts", but this provider will never play a role in my analyses, at least regarding a need to create abbreviations.)
Given the sketchy quality of healthcare data that we receive, however, it's possible something could slip by our scrubbing routines that we've never encountered before. If you've ever worked with healthcare data, I'm sure you know what I'm talking about.
Heh... That bloody first character problem was really starting to eat at me. I tried a couple of things that made for a pretty nasty slow down and then it dawned on me how simple this whole thing is. We'd already isolated the first character as the number 1 digit. The problem was identifying when to use it. If the answer was a snake, it would have bit me. I've REALLY got to get more sleep! :alien:
So, here's the new function with a very slight modification that only affects performance a little bit:
--===== Build a real set-based function to find the first letter of each word using an iTVF.
IF OBJECT_ID('tempdb.dbo.Abbreviate') IS NOT NULL
DROP FUNCTION dbo.Abbreviate
;
GO
CREATE FUNCTION dbo.Abbreviate
--===== Define I/O parameters
(@pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 WHERE LEFT(@pString,1) LIKE '[A-z0-9]' UNION ALL
SELECT t.N+1
FROM cteTally t
WHERE SUBSTRING(@pString,t.N,2) COLLATE Latin1_General_BIN
LIKE '[^A-z0-9][A-z0-9]' COLLATE Latin1_General_BIN
)
--===== Do the actual split of each character following a space and concatenate it.
SELECT Abbreviation = (SELECT SUBSTRING(@pString, s.N1,1) FROM cteStart s FOR XML PATH(''))
;
GO
I also changed the LIKE filter so that it would βbreakβ on anything that wasn't a letter or digit so that things like "1-800-contacts" would be abbreviated as "18c".
I also wanted to really make sure that there weren't likely to be any other possibilities I may have missed, so I ran it against the following test...
--===== Declare and preset some string to test with
DECLARE @SomeString VARCHAR(8000);
SELECT @SomeString = '###Some ##wierd ## sTrInG Level-3';
--===== Shift the string left to step through nearly every possibility
-- and show what the resulting abbreviation for the shifted string is.
WITH
cteTestString AS
(
SELECT RowNum = ROW_NUMBER() OVER (ORDER BY t.Number),
TestString = RIGHT(@SomeString, DATALENGTH(@SomeString)- t.Number)
FROM master.dbo.spt_values t
WHERE [Type] = 'P'
AND Number BETWEEN 0 AND DATALENGTH(@SomeString)
)
SELECT cte.TestString, abbv.Abbreviation
FROM cteTestString cte
CROSS APPLY dbo.Abbreviate(cte.TestString) abbv
ORDER BY cte.RowNum
;
No anomalies were detected during the visual part of that check.
Then I ran it against the million row test code like we did before. On my machine, it was only two seconds slower. On some of those rockets you folks tested on, the differences in performance will like be measured in milliseconds.
Just to be clear, I currently have the LIKE filter setup to return the first letter (regardless of case) or digit after anything that is not a letter or a digit. The first character is treated in a similar manner... it must be a letter or digit to be considered for the return.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 12, 2011 at 7:44 am
Jeff Moden (5/11/2011)
So, here's the new function with a very slight modification that only affects performance a little bit:
--===== Build a real set-based function to find the first letter of each word using an iTVF.
IF OBJECT_ID('tempdb.dbo.Abbreviate') IS NOT NULL
DROP FUNCTION dbo.Abbreviate
;
GO
CREATE FUNCTION dbo.Abbreviate
--===== Define I/O parameters
(@pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 WHERE LEFT(@pString,1) LIKE '[A-z0-9]' UNION ALL
SELECT t.N+1
FROM cteTally t
WHERE SUBSTRING(@pString,t.N,2) COLLATE Latin1_General_BIN
LIKE '[^A-z0-9][A-z0-9]' COLLATE Latin1_General_BIN
)
--===== Do the actual split of each character following a space and concatenate it.
SELECT Abbreviation = (SELECT SUBSTRING(@pString, s.N1,1) FROM cteStart s FOR XML PATH(''))
;
GO
Pure genius! The basic form of it is remains essentially a tally table splitter combined with xml concantenation - awesome.
Question: I don't quite understand why a tally table that's built "on the fly" is used rather than referencing a prebuilt tally table. Is the performance quite different using cteTally vs. an actual tally table?
Also, defining the delimiter as actually two characters (LIKE '[^A-z0-9][A-z0-9]') is just too damn clever!
--Pete
May 12, 2011 at 7:51 am
jcrawf02 (5/10/2011)
You know Jeff, eventually "ol' Paint" needs to take a trip to the glue factory π
:laugh: Too funny! I like it! ("glue factory" -- Ha! -- way better than saying "out to pasture")
May 12, 2011 at 8:53 pm
peterzeke (5/12/2011)
Question: I don't quite understand why a tally table that's built "on the fly" is used rather than referencing a prebuilt tally table. Is the performance quite different using cteTally vs. an actual tally table?
Thanks for the kudo's, Pete.
The "on the fly" Tally Table and the real Tally Table end up taking turns winning at various points in the "power curve". Even though a real Tally Table get's cached in memory, it uses a different mechanism that building it in memory using Itzek's basic high speed, cross joined, cascaded CTE's. The advantage of the CTE method is that you don't have to convince some nervous and appropriately cautious DBA into allowing an "Auxilliary Table of Numbers" (Tally Table) to be constructed and, just by adding one more layer of cross joins, you can make for some pretty big numbers in a hurry. One more layer would be 10,0002 or 100,000,000. Just one more lay after that would give you 10,000,000,000,000,000 (1E+16). Heh... even as fast as it is an how few "READS" it causes, you really wouldn't want to wait for that one to build. π
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 12 posts - 46 through 56 (of 56 total)
You must be logged in to reply to this topic. Login to reply