Splitting and aggregating a field with commas

Question

Post reply

Splitting and aggregating a field with commas

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 1

Insert into dbo.NormalizedModel (Animal)

Select ElementText

from dbo.Split('YourStringHere', 'delimiter')

Select Name as AnimalName, count(*) as Total

from dbo.NormalizedModel

group by Name

I count 2 lines of code here.

Farrell Keough SSCoach Points: 17414 More actions · Answer 2

I worked for a financial agency, (you would know the name) and we were getting lots of data in from many sources. Every time, I would be told this was a "one time thing". IT NEVER WAS. I can assure you, if you force yourself to get into the practice of normalizing data whenever and from whereever you get it - you will be much happier with yourself!

What I wrote was from a general parsing script I had from years ago. If you keep these kinds of scripts around, they become very useful...

I wasn't born stupid - I had to study.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

Ninja's_RGR'us

SSC Guru

Points: 294069

August 22, 2005 at 2:24 pm

#584069

Glad I don't need those here...

Farrell Keough SSCoach Points: 17414 More actions · Answer 4

Funny. You handed me one of your most common one's just hours ago...

I wasn't born stupid - I had to study.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 5

That's not for splitting, that's for concatenating and I actually never used it in production. I do use the split thought for list of ids to process.

Vasc SSCarpal Tunnel Points: 4643 More actions · Answer 6

With a small add of ',' at the end of each string you can do

declare @v-2 table(val char(50))

insert into @v-2

SELECT 'dog, mouse,cat,' UNION ALL

SELECT 'mouse, dog, fox, hen ,'

declare @t table(nmb int)

declare @i int

set @i=50

while @i>0

begin

insert into @t select @i

set @i=@i-1

end

select pet,count(*)

from

(select ltrim(reverse(ltrim(left(reverse(left(val,nmb-1)),

case charindex(',',reverse(left(val,nmb-1)))-1

when -1 then len(reverse(left(val,nmb-1)))

else charindex(',',reverse(left(val,nmb-1)))-1

end

&nbsp))) pet

from @t,@v

where substring(val,nmb,1)=',') a

group by pet

Kindest Regards,

Vasc

Chris Hedgate One Orange Chip Points: 25041 More actions · Answer 7

You are talking about application requirements. How the application works has nothing, absolutely nothing at all, with how the data should be modeled. The database is a collection of facts about some real-world information model. The relational model is the proven (over and over again) best way to manage those facts, especially to make sure that data integrity is not violated.

Whether or not the procedure will be executed once a week or twice every second is not interesting to me. The fact is that with this design it will take you much longer to create this procedure than it would to redesign the schema and then create a procedure that produces the same resulting output. And what about next time you need a similar requirement? You will again need to do a difficult implementation, again losing time.

--
Chris Hedgate http://www.hedgate.net/
Contributor to the Best of SQL Server Central volumes
Articles: http://www.sqlservercentral.com/columnists/chedgate/

Chris Hedgate One Orange Chip Points: 25041 More actions · Answer 8

Yes, whenever you are using a relation DBMS you should design correctly. It saves you time, gives you the best performance in most cases and most importantly avoids problems with lost data integrity. And it's easier!

--
Chris Hedgate http://www.hedgate.net/
Contributor to the Best of SQL Server Central volumes
Articles: http://www.sqlservercentral.com/columnists/chedgate/

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 9

Hang on there, Paully21, I'll get to you in a minute...

I absolutely agree with everyone that having CSV columns in a table is a basic form of "death by SQL". Despite the best of intentions or reasons for the use of CSV columns, we all know it will punish you throughout the life of the database much like creating a new archive database for every month will (a recent post a couple of us worked on :sick.

Still, it's an interesting problem and we can introduce Paully21 to the wonders of a Tally/Numbers table...

Ok, Paully21... here we go...

There's lot's of routines to split CSV columns and most of them aren't worth the trons they're stored in. There's a couple of really smart folks on this forum that have come up with the notion of a thing called a "numbers" table... being old school, I call it a "Tally" table because it's used to tally things during a process. Basically, it allows you to do the same things a loop would do but without the loop. Splitting CSV columns is a really great example of Tally table use.

Before we can do anything with a Tally table, we first have to create it... this is a one time deal. Here's the script... hope you don't have a BSOFH for a DBA...

--===== First, create the tally table with

CREATE TABLE dbo.Tally (n SMALLINT IDENTITY(1,1) PRIMARY KEY, DoDah CHAR(1))

--===== Populate it with a paltry 8000 numbers using a cross-join to avoid

-- doing it with a loop (which wouldn't hurt here, just cool to do without).

-- SYSCOLUMNS is guaranteed to have at least 256 entries so a single

-- cross-join is good up to 65,536 numbers. We only need 8k.

INSERT INTO dbo.Tally (DoDah)

SELECT TOP 8000 NULL

FROM dbo.SYSCOLUMNS sc1,

dbo.SYSCOLUMNS sc2

--===== Drop the DoDah column... it's not needed anymore.

ALTER TABLE dbo.Tally

DROP COLUMN DoDah

The Tally table has 8000 numbers in it starting at 1, incrementing by 1, and ending at 8000. I gotta warn ya... for some reason, many DBA's just can't stand to part with the lousy 16 or 20 kbytes this table takes up even though it can save hundreds of hours of processing time in a single year.

Next, we need a User Defined Function that will split any CSV string we throw at it. While we're at it, let's add a wee bit o' logic to cleanup things like extra spaces and the like...

CREATE FUNCTION dbo.fnSplit

/**********************************************************

Purpose:

This function splits names and strings of words up to 8000

characters long based on a delimiter. Multiple adjacent

spaces are reduced to single spaces throughout the input

string. Only one word is returned at a time. Leading and

trailing spaces for each word/word combo are dropped.

Usage:

dbo.fnSplit(stringtosplit,wordtoreturn,delimiter)

Notes: Don't change the datatype nor the length of

@Delimiter in the code. The code hasn't been set

up to use multicharacter delimiters (yet).

Revisions:

Rev 00 - 08/17/2005 - Jeff Moden - Initial Creation/Test

**********************************************************/

--=========================================================

-- Define the input/output parameters

--=========================================================

(

@String VARCHAR(8000), --The string to split

@WordNum INT = 1, --Which word to return

@Delimiter CHAR(1) = ',' --What the delimiter is

)

RETURNS VARCHAR(8000)

AS

BEGIN

--=========================================================

-- Setup and Presets

--=========================================================

--===== Determine the max number of characters to process

-- by counting characters in the string to split

DECLARE @MaxPosit INT

SET @MaxPosit = LEN(@String)

--===== Create a place to store the NameParts

DECLARE @Words TABLE

(

WordNum INT IDENTITY PRIMARY KEY,

Word VARCHAR(8000)

&nbsp

--===== This converts multiple spaces into single spaces

-- no matter how many occur or where in the string

SELECT @String = REPLACE(@String,' ',' ')

FROM dbo.Tally

WHERE [n] <= @MaxPosit/2+1

--===== Add a delimiter to the beginning and end to allow

-- words to be found at each end

SET @String = @Delimiter+@String+@Delimiter

--===== Recalc the maximum number of characters for speed

SET @MaxPosit = LEN(@String)

--=========================================================

-- Find the desired word

--=========================================================

--===== Parse the NameParts and store them

INSERT INTO @Words (Word)

SELECT SUBSTRING(@String,[n]+1,

CHARINDEX(@Delimiter,@String,[n]+1)-[n]-1

&nbsp AS Word

FROM dbo.Tally

WHERE [n] >= 1

AND [n] < @MaxPosit - 1

AND SUBSTRING(@String,[n],1) = @Delimiter

ORDER BY [n]

--===== Return the desired word without any leading or

-- trailing spaces if they occur

RETURN (

SELECT LTRIM(RTRIM(Word))

FROM @Words

WHERE WordNum = @WordNum

&nbsp

--=========================================================

-- End of the Function

--=========================================================

END

Most of the above code is setup and cleanup of the inputs... the meat of the function is in the section called "Parse the NameParts and store them". It get's a little inefficient if you have hundreds or thousands of CSV values in the same record because it solves them all even though it only returns one at a time (I'm working on that). But, it's still fast. Think of the function as "PARSENAME on steroids" because, like PARSENAME, you have to tell it which word (left to right, in this case) to return. You can even tell it what single character delimiter you want to use.

Now that we have that done, we can actually (finally) get to your original problem... I'll let the embedded comments do all the talking...

--===== If temp table to hold animal lists exists, drop it and rebuild

IF OBJECT_ID('TempDB..#Animals') IS NOT NULL

DROP TABLE #Animals

CREATE TABLE dbo.#Animals

(ID INT IDENTITY(1,1),

CSVAnimals VARCHAR(200))

--===== Populate the Animals table with sample CSV data

-- Notice the abhorant fashion of the inputs

INSERT INTO dbo.#Animals (CSVAnimals)

SELECT 'cat, dog, mouse' UNION ALL

SELECT ' mouse,dog, fox , hen ' UNION ALL

SELECT 'hen' UNION ALL

SELECT ' hen ' UNION ALL

SELECT 'fox,hen' UNION ALL

SELECT 'mongoose, goldfish,cat' UNION ALL

SELECT 'dog , cat' UNION ALL

SELECT 'pussy cat, big dog'

--===== All set... let's process... ==================================

--===== Create a temp table to hold each occurance of each animal

IF OBJECT_ID('TempDB..#MyTemp') IS NOT NULL

DROP TABLE #MyTemp

CREATE TABLE #MyTemp

(ID INT IDENTITY(1,1),

Animal VARCHAR(30))

--===== Can't do 100% set based but we can do a column at a time

-- where a "column" is the occurance of an animal in the

-- original table. Example, the 'cat, dog, mouse' has

-- 3 "columns" of data. So, declare some variables to help

-- us through that mess.

DECLARE @ColCounter SMALLINT

DECLARE @MaxCols SMALLINT

--===== Find the max number of columns to process by counting commas

-- and adding 1

SELECT @MaxCols = MAX(LEN(CSVAnimals)-LEN(REPLACE(CSVAnimals,',',''))+1)

FROM dbo.Animals

--===== Now, split each animal out one column at a time but all rows

-- in the source at the same time. The counter keeps track of

-- which column we're working on. The example data at the

-- beginning of this script will cause only 4 passes even

-- though there's 8 rows of data because there's only 4 "columns".

SELECT @ColCounter = 1

WHILE @ColCounter <= @MaxCols

BEGIN

INSERT INTO #MyTemp (Animal)

SELECT dbo.fnSplit(CSVAnimals,@ColCounter,',')

FROM dbo.Animals

WHERE dbo.fnSplit(CSVAnimals,@ColCounter,',') IS NOT NULL

SELECT @ColCounter = @ColCounter+1

END

--===== Now, calculate and display the counts for each animal

-- from the single records we accumulated in the temp table.

SELECT UPPER(Animal),COUNT(Animal) AS TheCount

FROM #MyTemp

GROUP BY Animal

ORDER BY Animal

And, now, hopefully, you see what everyone was talking about in the previous 30 posts or so... CSV is a pain in the socks and should be avoided at all costs. As so many posts suggested, a normalized child table with one value per row would be much more effective in the long run. Yes, the steps above solve the problem but look at all of the code you have to deal with... and that's just for one bloody scenario. Yes, the function helps in all scenarios but it runs HUNDREDS of times slower than true, normalized, set-based processing. And, it uses one of the fastest methods available for solving this type of problem (Kudos to Adam Mechanic and others).

Lemme know how the code works out for you, eh?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 10

Ninja's_RGR'us

SSC Guru

Points: 294069

August 22, 2005 at 11:21 pm

#584144

Nice work... too bad I won't ever need that .

Vasc SSCarpal Tunnel Points: 4643 More actions · Answer 11

My post got lost between some sol NOT SET based so I ll posted again (the while was just to generate the numbers table)

SET BASED solution

declare @v-2 table(val varchar(50))

insert into @v-2

SELECT 'dog, mouse,cat' UNION ALL

SELECT 'mouse, dog, fox, hen'

declare @t table(nmb int identity(1,1),[dummy] char(1))

insert into @t ([dummy])

select top 51 NULL from master.dbo.syscolumns

select pet,count(*)

from

(select ltrim(reverse(ltrim(left(reverse(left(val+',',nmb-1)),

case charindex(',',reverse(left(val+',',nmb-1)))-1

when -1 then len(reverse(left(val+',',nmb-1)))

else charindex(',',reverse(left(val+',',nmb-1)))-1

end)))) pet

from @t,@v

where substring(val+',',nmb,1)=',') a

group by pet

Kindest Regards,

Vasc

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 12

Yeah, I wish I didn't need it! Wonderful 3rd party solution at work. Bunch o' DB rookies... these are the same wonderful folks that built an Oracle-Like sequence table (NextID) into their wonderful frigging system because they didn't know how to spell "IDENTITY". To give you a hint of how bad this 3rd party software is, the NextID table was the source of an average of 620 deadlocks PER DAY when I started work there about 20 months ago. Now THAT was fun to fix!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 13

I can't even imagine the nightmare... I'm just hopping that it was not a 24/7 shop and that you had time to apply the updates... not to mention a kick ass testing/QA system.

Jeff Moden SSC Guru Points: 1004441 More actions · Answer 14

Wasn't that lucky... it's 24/7. We get to bounce the server once a week for "impacting changes" for about 30 minutes... then, QA get's to verify what we've done. Of course, we test the heck out of the changes on another server before any code even come close to the production box.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sholliday SSCrazy Points: 2100 More actions · Answer 15

I didn't read this whole post, but when I got to the part "why is this bad design?", I stopped to post a reply.

You are violating a rule called "1st Normal Form"

To disallow multivalued attributes, composite attributes, and thier combination.

"Only atomic (simple, indivisible) values"

..

ISBN: 0321122267