Splitting and aggregating a field with commas

  • Insert into dbo.NormalizedModel (Animal)

    Select ElementText

    from dbo.Split('YourStringHere', 'delimiter')

    Select Name as AnimalName, count(*) as Total

    from dbo.NormalizedModel

    group by Name

    I count 2 lines of code here.

  • I worked for a financial agency, (you would know the name) and we were getting lots of data in from many sources.  Every time, I would be told this was a "one time thing".  IT NEVER WAS.  I can assure you, if you force yourself to get into the practice of normalizing data whenever and from whereever you get it  - you will be much happier with yourself! 

    What I wrote was from a general parsing script I had from years ago.  If you keep these kinds of scripts around, they become very useful...

    I wasn't born stupid - I had to study.

  • Glad I don't need those here...

  • Funny.    You handed me one of your most common one's just hours ago... 

    I wasn't born stupid - I had to study.

  • That's not for splitting, that's for concatenating and I actually never used it in production. I do use the split thought for list of ids to process.

  • With a small add of ',' at the end of each string you can do

    declare @v-2 table(val char(50))

    insert into @v-2

    SELECT 'dog, mouse,cat,' UNION ALL

    SELECT 'mouse, dog, fox, hen ,'

    declare @t table(nmb int)

    declare @i int

    set @i=50

    while @i>0

    begin

    insert into @t select @i

    set @i=@i-1

    end

    select pet,count(*)

    from

    (select ltrim(reverse(ltrim(left(reverse(left(val,nmb-1)),

     case charindex(',',reverse(left(val,nmb-1)))-1

      when -1 then len(reverse(left(val,nmb-1)))

      else charindex(',',reverse(left(val,nmb-1)))-1

     end

    &nbsp))) pet

    from @t,@v

    where substring(val,nmb,1)=',') a

    group by pet


    Kindest Regards,

    Vasc

  • You are talking about application requirements. How the application works has nothing, absolutely nothing at all, with how the data should be modeled. The database is a collection of facts about some real-world information model. The relational model is the proven (over and over again) best way to manage those facts, especially to make sure that data integrity is not violated.

    Whether or not the procedure will be executed once a week or twice every second is not interesting to me. The fact is that with this design it will take you much longer to create this procedure than it would to redesign the schema and then create a procedure that produces the same resulting output. And what about next time you need a similar requirement? You will again need to do a difficult implementation, again losing time.

  • Yes, whenever you are using a relation DBMS you should design correctly. It saves you time, gives you the best performance in most cases and most importantly avoids problems with lost data integrity. And it's easier!

  • Hang on there, Paully21, I'll get to you in a minute...

    I absolutely agree with everyone that having CSV columns in a table is a basic form of "death by SQL".  Despite the best of intentions or reasons for the use of CSV columns, we all know it will punish you throughout the life of the database much like creating a new archive database for every month will (a recent post a couple of us worked on :sick.

    Still, it's an interesting problem and we can introduce Paully21 to the wonders of a Tally/Numbers table...

    Ok, Paully21... here we go...

    There's lot's of routines to split CSV columns and most of them aren't worth the trons they're stored in.  There's a couple of really smart folks on this forum that have come up with the notion of a thing called a "numbers" table... being old school, I call it a "Tally" table because it's used to tally things during a process.  Basically, it allows you to do the same things a loop would do but without the loop.  Splitting CSV columns is a really great example of Tally table use.   

    Before we can do anything with a Tally table, we first have to create it... this is a one time deal.  Here's the script... hope you don't have a BSOFH for a DBA...

    --===== First, create the tally table with

    CREATE TABLE dbo.Tally (n SMALLINT IDENTITY(1,1) PRIMARY KEY, DoDah CHAR(1))

    --===== Populate it with a paltry 8000 numbers using a cross-join to avoid

         -- doing it with a loop (which wouldn't hurt here, just cool to do without).

         -- SYSCOLUMNS is guaranteed to have at least 256 entries so a single

         -- cross-join is good up to 65,536 numbers.  We only need 8k.

     INSERT INTO dbo.Tally (DoDah)

     SELECT TOP 8000 NULL

       FROM dbo.SYSCOLUMNS sc1,

            dbo.SYSCOLUMNS sc2

    --===== Drop the DoDah column... it's not needed anymore.

      ALTER TABLE dbo.Tally

            DROP COLUMN DoDah

    The Tally table has 8000 numbers in it starting at 1, incrementing by 1, and ending at 8000.  I gotta warn ya... for some reason, many DBA's just can't stand to part with the lousy 16 or 20 kbytes this table takes up even though it can save hundreds of hours of processing time in a single year.

    Next, we need a User Defined Function that will split any CSV string we throw at it.  While we're at it, let's add a wee bit o' logic to cleanup things like extra spaces and the like...

    CREATE FUNCTION dbo.fnSplit

    /**********************************************************

     Purpose:

     This function splits names and strings of words up to 8000

     characters long based on a delimiter.  Multiple adjacent

     spaces are reduced to single spaces throughout the input

     string.  Only one word is returned at a time. Leading and

     trailing spaces for each word/word combo are dropped.

     Usage:

     dbo.fnSplit(stringtosplit,wordtoreturn,delimiter)

     Notes: Don't change the datatype nor the length of

            @Delimiter in the code.  The code hasn't been set

            up to use multicharacter delimiters (yet).

     Revisions:

     Rev 00 - 08/17/2005 - Jeff Moden - Initial Creation/Test

    **********************************************************/

    --=========================================================

    --      Define the input/output parameters

    --=========================================================

    (

    @String    VARCHAR(8000),    --The string to split

    @WordNum   INT = 1,          --Which word to return

    @Delimiter CHAR(1) = ','     --What the delimiter is

    )

    RETURNS VARCHAR(8000)

    AS

    BEGIN

    --=========================================================

    --      Setup and Presets

    --=========================================================

    --===== Determine the max number of characters to process

         -- by counting characters in the string to split

    DECLARE @MaxPosit INT

        SET @MaxPosit = LEN(@String)

    --===== Create a place to store the NameParts

    DECLARE @Words TABLE

      (

      WordNum INT IDENTITY PRIMARY KEY,

      Word    VARCHAR(8000)

     &nbsp

    --===== This converts multiple spaces into single spaces

         -- no matter how many occur or where in the string

     SELECT @String = REPLACE(@String,'  ',' ')

       FROM dbo.Tally

      WHERE [n] <= @MaxPosit/2+1

    --===== Add a delimiter to the beginning and end to allow

         -- words to be found at each end

        SET @String = @Delimiter+@String+@Delimiter

    --===== Recalc the maximum number of characters for speed

        SET @MaxPosit = LEN(@String)

    --=========================================================

    --      Find the desired word

    --=========================================================

    --===== Parse the NameParts and store them

     INSERT INTO @Words (Word)

     SELECT SUBSTRING(@String,[n]+1,

       CHARINDEX(@Delimiter,@String,[n]+1)-[n]-1

     &nbsp AS Word

       FROM dbo.Tally

      WHERE [n] >= 1

        AND [n] < @MaxPosit - 1

        AND SUBSTRING(@String,[n],1) = @Delimiter

      ORDER BY [n]

    --===== Return the desired word without any leading or

         -- trailing spaces if they occur

     RETURN (

       SELECT LTRIM(RTRIM(Word))

         FROM @Words

        WHERE WordNum = @WordNum

     &nbsp

    --=========================================================

    --      End of the Function

    --=========================================================

    END

    Most of the above code is setup and cleanup of the inputs... the meat of the function is in the section called "Parse the NameParts and store them".  It get's a little inefficient if you have hundreds or thousands of CSV values in the same record because it solves them all even though it only returns one at a time (I'm working on that).  But, it's still fast.  Think of the function as "PARSENAME on steroids" because, like PARSENAME, you have to tell it which word (left to right, in this case) to return.  You can even tell it what single character delimiter you want to use.

    Now that we have that done, we can actually (finally) get to your original problem... I'll let the embedded comments do all the talking...

    --===== If temp table to hold animal lists exists, drop it and rebuild

         IF OBJECT_ID('TempDB..#Animals') IS NOT NULL

            DROP TABLE #Animals

     CREATE TABLE dbo.#Animals

     (ID INT IDENTITY(1,1),

     CSVAnimals VARCHAR(200))

    --===== Populate the Animals table with sample CSV data

         -- Notice the abhorant fashion of the inputs

     INSERT INTO dbo.#Animals (CSVAnimals)

     SELECT 'cat, dog, mouse'        UNION ALL

     SELECT ' mouse,dog, fox , hen ' UNION ALL

     SELECT 'hen'                    UNION ALL

     SELECT ' hen '                  UNION ALL

     SELECT 'fox,hen'                UNION ALL

     SELECT 'mongoose, goldfish,cat' UNION ALL

     SELECT 'dog , cat'              UNION ALL

     SELECT 'pussy    cat, big dog'

    --===== All set... let's process... ==================================

    --===== Create a temp table to hold each occurance of each animal

         IF OBJECT_ID('TempDB..#MyTemp') IS NOT NULL

            DROP TABLE #MyTemp

     CREATE TABLE #MyTemp

     (ID INT IDENTITY(1,1),

     Animal VARCHAR(30))

    --===== Can't do 100% set based but we can do a column at a time

         -- where a "column" is the occurance of an animal in the

         -- original table.  Example, the 'cat, dog, mouse' has

         -- 3 "columns" of data.  So, declare some variables to help

         -- us through that mess.

    DECLARE @ColCounter SMALLINT

    DECLARE @MaxCols    SMALLINT

    --===== Find the max number of columns to process by counting commas

         -- and adding 1

     SELECT @MaxCols = MAX(LEN(CSVAnimals)-LEN(REPLACE(CSVAnimals,',',''))+1)

       FROM dbo.Animals

    --===== Now, split each animal out one column at a time but all rows

         -- in the source at the same time.  The counter keeps track of

         -- which column we're working on.  The example data at the

         -- beginning of this script will cause only 4 passes even

         -- though there's 8 rows of data because there's only 4 "columns".

     SELECT @ColCounter = 1

      WHILE @ColCounter <= @MaxCols

      BEGIN

             INSERT INTO #MyTemp (Animal)

             SELECT dbo.fnSplit(CSVAnimals,@ColCounter,',')

               FROM dbo.Animals

              WHERE dbo.fnSplit(CSVAnimals,@ColCounter,',') IS NOT NULL

     SELECT @ColCounter = @ColCounter+1

        END

    --===== Now, calculate and display the counts for each animal

         -- from the single records we accumulated in the temp table.

     SELECT UPPER(Animal),COUNT(Animal) AS TheCount

       FROM #MyTemp

      GROUP BY Animal

      ORDER BY Animal

    And, now, hopefully, you see what everyone was talking about in the previous 30 posts or so... CSV is a pain in the socks and should be avoided at all costs.  As so many posts suggested, a normalized child table with one value per row would be much more effective in the long run.  Yes, the steps above solve the problem but look at all of the code you have to deal with... and that's just for one bloody scenario.  Yes, the function helps in all scenarios but it runs HUNDREDS of times slower than true, normalized, set-based processing.  And, it uses one of the fastest methods available for solving this type of problem (Kudos to Adam Mechanic and others).

    Lemme know how the code works out for you, eh?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Nice work... too bad I won't ever need that .

  • My post got lost between some sol NOT SET based so I ll posted again (the while was just to generate the numbers table)

    SET BASED solution

    declare @v-2 table(val varchar(50))

    insert into @v-2

    SELECT 'dog, mouse,cat' UNION ALL

    SELECT 'mouse, dog, fox, hen'

    declare @t table(nmb int identity(1,1),[dummy] char(1))

    insert into @t ([dummy])

    select top 51 NULL from master.dbo.syscolumns

    select pet,count(*)

    from

    (select ltrim(reverse(ltrim(left(reverse(left(val+',',nmb-1)),

     case charindex(',',reverse(left(val+',',nmb-1)))-1

      when -1 then len(reverse(left(val+',',nmb-1)))

      else charindex(',',reverse(left(val+',',nmb-1)))-1

     end)))) pet

    from @t,@v

    where substring(val+',',nmb,1)=',') a

    group by pet


    Kindest Regards,

    Vasc

  • Yeah, I wish I didn't need it!  Wonderful 3rd party solution at work.  Bunch o' DB rookies... these are the same wonderful folks that built an Oracle-Like sequence table (NextID) into their wonderful frigging system because they didn't know how to spell "IDENTITY".  To give you a hint of how bad this 3rd party software is, the NextID table was the source of an average of 620 deadlocks PER DAY when I started work there about 20 months ago.  Now THAT was fun to fix!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I can't even imagine the nightmare... I'm just hopping that it was not a 24/7 shop and that you had time to apply the updates... not to mention a kick ass testing/QA system.

  • Wasn't that lucky... it's 24/7.  We get to bounce the server once a week for "impacting changes" for about 30 minutes... then, QA get's to verify what we've done.  Of course, we test the heck out of the changes on another server before any code even come close to the production box.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I didn't read this whole post, but when I got to the part "why is this bad design?", I stopped to post a reply.

    You are violating a rule called "1st Normal Form"

     

    To disallow multivalued attributes, composite attributes, and thier combination.

    "Only atomic (simple, indivisible) values"

     

    ..

    ISBN: 0321122267

     

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply