Get the First letter of each word within a string and returned a concatenated string

Question

Get the First letter of each word within a string and returned a concatenated string

Viewing 12 posts - 46 through 56 (of 56 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004553 More actions · Answer 1

Mark-101232 (5/11/2011)
For completeness, here's a simple SQLCLR to perform the same task, appears to perform okay.

Now we know part of the reason why I usually provide a mountain of generated test data... I love speed! Thanks for the code and for running the test, Mark.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004553 More actions · Answer 2

garima.khandelwal (5/11/2011)
T-SQL program fetching the first character of each word in the given string :
DECLARE @STRING VARCHAR(8000);
DECLARE @Pos int =1;
DECLARE @FINALSTRING VARCHAR(8000) = '';
DECLARE @COUNTER int = 0;
DECLARE @DELIMINATOR VARCHAR(1) = ' ';
SET @STRING = 'COMMUNITY GENERAL HOSPITAL';
/*Finds the last occurrence of the delimitated character*/
SET @COUNTER = LEN(@STRING) - CHARINDEX(@DELIMINATOR,REVERSE(@STRING))
/*Fetches first character*/
SET @FINALSTRING = SUBSTRING(@STRING,@Pos,1)
WHILE ( @Pos <= @COUNTER)
BEGIN
/*Finds the other occurrences of the delimitated character*/
SET @Pos = CHARINDEX(@DELIMINATOR,@STRING,@Pos+1)
SET @FINALSTRING = @FINALSTRING + substring(@STRING,@Pos+1,1);
END
SELECT @FINALSTRING

Just a reminder... this particular thread is on an SQL Server 2005 forum. The SQL Server 2008 declaration tricks you used won't work in SQL Server 2005.

Also, the delimiter isn't actually a space character, in this case. Rather it's the second character of a space followed by either a letter or a digit. If we use your code (converted for proper operation in 2005) with a string that has extra spaces in it, you'll see what I mean...

DECLARE @STRING VARCHAR(8000);

DECLARE @Pos INT

SELECT @Pos =1;

DECLARE @FINALSTRING VARCHAR(8000)

SELECT @FINALSTRING= '';

DECLARE @COUNTER int

SELECT @Counter = 0;

DECLARE @DELIMINATOR VARCHAR(1)

SELECT @Deliminator = ' ';

SET @STRING = 'COMMUNITY GENERAL HOSPITAL';

/*Finds the last occurrence of the delimitated character*/

SET @COUNTER = LEN(@STRING) - CHARINDEX(@DELIMINATOR,REVERSE(@STRING))

/*Fetches first character*/

SET @FINALSTRING = SUBSTRING(@STRING,@Pos,1)

WHILE ( @Pos <= @COUNTER)

BEGIN

/*Finds the other occurrences of the delimitated character*/

SET @Pos = CHARINDEX(@DELIMINATOR,@STRING,@Pos+1)

SET @FINALSTRING = @FINALSTRING + substring(@STRING,@Pos+1,1);

END

SELECT @FINALSTRING

Results:

-------------------

C GH

(1 row(s) affected)

Also, the use of a WHILE loop (especially with the rather expensive REVERSE function) is one of the slowest ways to do a split. Please see the following article for a huge amount of testing on that...

http://www.sqlservercentral.com/articles/Tally+Table/72993/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004553 More actions · Answer 3

peterzeke (5/11/2011)
Jeff Moden (5/10/2011)
Pete,
Based on the findings in the post above, will you ever have a condition where the first character isn't a letter?
Jeff -- the first character should just be a letter.
By the way, by "first character" I'm referring to the first character of the entire string, not the first character of every word within the string.
(There is one exception where the healthcare provider's name is "1-800-contacts", but this provider will never play a role in my analyses, at least regarding a need to create abbreviations.)
Given the sketchy quality of healthcare data that we receive, however, it's possible something could slip by our scrubbing routines that we've never encountered before. If you've ever worked with healthcare data, I'm sure you know what I'm talking about.

Oh yeah. It has always amazed me at how bad the condition of the data can be for something so very important. I've also seen hyphenated words and a whole bunch of other stuff that may affect abbreviations for you. I'm getting ready to haul it into work so I'll have to get back to this tonight but we can modify the code to handle even those eventualities and also make it so that "1-800-contacts" returns "18c".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jcrawf02 SSC-Insane Points: 24198 More actions · Answer 4

Pete, curious why you're doing this? Won't you have to explain the abbreviations to users anyway? What about multiples/duplicates (Mercy Hospital / Mercer Hospital / Mary's Hospital/ etc?) Are you going to pair the abbrev with the NPI or something to "uniquify" it?

---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]

"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

WayneS SSC Guru Points: 95461 More actions · Answer 5

Jeff Moden (5/11/2011)
Also, the delimiter isn't actually a space character, in this case. Rather it's the second character of a space followed by either a letter or a digit. If we use your code (converted for proper operation in 2005) with a string that has extra spaces in it, you'll see what I mean...
Results:
-------------------
C GH
(1 row(s) affected)

Aw, now that is something that I didn't consider in the code that I posted above either. Good catch Jeff.

Please see the following article for a huge amount of testing on that... http://www.sqlservercentral.com/articles/Tally+Table/72993/

Yes there was quite a bit of testing on that... both for the article, and then again afterwards in the discussion. You should read both!

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 6

jcrawf02 (5/11/2011)
Pete, curious why you're doing this? Won't you have to explain the abbreviations to users anyway? What about multiples/duplicates (Mercy Hospital / Mercer Hospital / Mary's Hospital/ etc?) Are you going to pair the abbrev with the NPI or something to "uniquify" it?

I'm working on developing a patient dashboard. One of the requirements is to display the facility name of when lab/screening/test value results are not available (i.e., patient had an LDL on 1/1/2011 but the result is not available). In those instances, our case managers can contact the facilities to obtain results. I'm ultimately contending with a "real estate" issue regarding how to fit everything on a single screen/page, especially since the requirement for displaying lab data involves showing the past four dates of lab services. Fitting lab/screening type, lab date, and lab result side-by-side (i.e., like a pivot table with lab test names on the left-most column, and lab dates/results on columns) is a tight fit. Adding in the name of the facility makes it that much harder, so abbreviations should dodge the real estate issue, especially when forcing the facility name in the lab value field if the lab value is null/empty.

If it seems I'm running into abbreviation confusion as you indicated, then I'll test whether taking the first "word" of the facility combined with the abbreviation will resolve any confusion. Also, I'll likely test as to whether having an abbreviation key in the page footer will provide clarity.

jcrawf02 SSC-Insane Points: 24198 More actions · Answer 7

Totally understand the real estate issue. Bet they want to print it out too, right?

We've been working on a patient dashboard ourselves but haven't gotten the lab values in there yet, I'll PM you with what we have, maybe we can bounce ideas off each other to mutual advantage.

---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]

"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

Vedran Kesegic SSCrazy Eights Points: 9302 More actions · Answer 8

Thank you all

_____________________________________________________
Microsoft Certified Master: SQL Server 2008
XDetails Addin - for SQL Developers
blog.sqlxdetails.com - Transaction log myths

Jeff Moden SSC Guru Points: 1004553 More actions · Answer 9

peterzeke (5/11/2011)
Jeff Moden (5/10/2011)
Pete,
Based on the findings in the post above, will you ever have a condition where the first character isn't a letter?
Jeff -- the first character should just be a letter.
By the way, by "first character" I'm referring to the first character of the entire string, not the first character of every word within the string.
(There is one exception where the healthcare provider's name is "1-800-contacts", but this provider will never play a role in my analyses, at least regarding a need to create abbreviations.)
Given the sketchy quality of healthcare data that we receive, however, it's possible something could slip by our scrubbing routines that we've never encountered before. If you've ever worked with healthcare data, I'm sure you know what I'm talking about.

Heh... That bloody first character problem was really starting to eat at me. I tried a couple of things that made for a pretty nasty slow down and then it dawned on me how simple this whole thing is. We'd already isolated the first character as the number 1 digit. The problem was identifying when to use it. If the answer was a snake, it would have bit me. I've REALLY got to get more sleep! :alien:

So, here's the new function with a very slight modification that only affects performance a little bit:

--===== Build a real set-based function to find the first letter of each word using an iTVF.

IF OBJECT_ID('tempdb.dbo.Abbreviate') IS NOT NULL

DROP FUNCTION dbo.Abbreviate

;

GO

CREATE FUNCTION dbo.Abbreviate

--===== Define I/O parameters

(@pString VARCHAR(8000))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover VARCHAR(8000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

SELECT 1 WHERE LEFT(@pString,1) LIKE '[A-z0-9]' UNION ALL

SELECT t.N+1

FROM cteTally t

WHERE SUBSTRING(@pString,t.N,2) COLLATE Latin1_General_BIN

LIKE '[^A-z0-9][A-z0-9]' COLLATE Latin1_General_BIN

)

--===== Do the actual split of each character following a space and concatenate it.

SELECT Abbreviation = (SELECT SUBSTRING(@pString, s.N1,1) FROM cteStart s FOR XML PATH(''))

;

GO

I also changed the LIKE filter so that it would “break” on anything that wasn't a letter or digit so that things like "1-800-contacts" would be abbreviated as "18c".

I also wanted to really make sure that there weren't likely to be any other possibilities I may have missed, so I ran it against the following test...

--===== Declare and preset some string to test with

DECLARE @SomeString VARCHAR(8000);

SELECT @SomeString = '###Some ##wierd ## sTrInG Level-3';

--===== Shift the string left to step through nearly every possibility

-- and show what the resulting abbreviation for the shifted string is.

WITH

cteTestString AS

(

SELECT RowNum = ROW_NUMBER() OVER (ORDER BY t.Number),

TestString = RIGHT(@SomeString, DATALENGTH(@SomeString)- t.Number)

FROM master.dbo.spt_values t

WHERE [Type] = 'P'

AND Number BETWEEN 0 AND DATALENGTH(@SomeString)

)

SELECT cte.TestString, abbv.Abbreviation

FROM cteTestString cte

CROSS APPLY dbo.Abbreviate(cte.TestString) abbv

ORDER BY cte.RowNum

;

No anomalies were detected during the visual part of that check.

Then I ran it against the million row test code like we did before. On my machine, it was only two seconds slower. On some of those rockets you folks tested on, the differences in performance will like be measured in milliseconds.

Just to be clear, I currently have the LIKE filter setup to return the first letter (regardless of case) or digit after anything that is not a letter or a digit. The first character is treated in a similar manner... it must be a letter or digit to be considered for the return.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 10

Jeff Moden (5/11/2011)
So, here's the new function with a very slight modification that only affects performance a little bit:
--===== Build a real set-based function to find the first letter of each word using an iTVF.
IF OBJECT_ID('tempdb.dbo.Abbreviate') IS NOT NULL
DROP FUNCTION dbo.Abbreviate
;
GO
CREATE FUNCTION dbo.Abbreviate
--===== Define I/O parameters
(@pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 WHERE LEFT(@pString,1) LIKE '[A-z0-9]' UNION ALL
SELECT t.N+1
FROM cteTally t
WHERE SUBSTRING(@pString,t.N,2) COLLATE Latin1_General_BIN
LIKE '[^A-z0-9][A-z0-9]' COLLATE Latin1_General_BIN
)
--===== Do the actual split of each character following a space and concatenate it.
SELECT Abbreviation = (SELECT SUBSTRING(@pString, s.N1,1) FROM cteStart s FOR XML PATH(''))
;
GO

Pure genius! The basic form of it is remains essentially a tally table splitter combined with xml concantenation - awesome.

Question: I don't quite understand why a tally table that's built "on the fly" is used rather than referencing a prebuilt tally table. Is the performance quite different using cteTally vs. an actual tally table?

Also, defining the delimiter as actually two characters (LIKE '[^A-z0-9][A-z0-9]') is just too damn clever!

--Pete

peterzeke SSCrazy Eights Points: 9003 More actions · Answer 11

jcrawf02 (5/10/2011)
You know Jeff, eventually "ol' Paint" needs to take a trip to the glue factory

:laugh: Too funny! I like it! ("glue factory" -- Ha! -- way better than saying "out to pasture")

Jeff Moden SSC Guru Points: 1004553 More actions · Answer 12

peterzeke (5/12/2011)
Question: I don't quite understand why a tally table that's built "on the fly" is used rather than referencing a prebuilt tally table. Is the performance quite different using cteTally vs. an actual tally table?

Thanks for the kudo's, Pete.

The "on the fly" Tally Table and the real Tally Table end up taking turns winning at various points in the "power curve". Even though a real Tally Table get's cached in memory, it uses a different mechanism that building it in memory using Itzek's basic high speed, cross joined, cascaded CTE's. The advantage of the CTE method is that you don't have to convince some nervous and appropriately cautious DBA into allowing an "Auxilliary Table of Numbers" (Tally Table) to be constructed and, just by adding one more layer of cross joins, you can make for some pretty big numbers in a hurry. One more layer would be 10,000² or 100,000,000. Just one more lay after that would give you 10,000,000,000,000,000 (1E+16). Heh... even as fast as it is an how few "READS" it causes, you really wouldn't want to wait for that one to build.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Get the First letter of each word within a string and returned a concatenated string

Cookies on SQLServerCentral