Splitting and aggregating a field with commas

  • I have a table with a field that contains comma-delimited data.  Here is a sample of two rows and the values for each field

    cat, dog, mouse

    mouse, dog, fox, hen

    I need to get a list of each (animal) and a count for each as well

     

    Desired result

    cat 1

    dog 2

    mouse 2

    fox 1

    hen 1

     

    The table will contain at most 500 rows and there are at most 10 possibile values.

    Thanks!

    Paully

     

  • I guess this might not be the answer you want, but you should really change the design. That is just not proper design and you will spend much more time trying to create queries such as the one you request than what it would take to fix the design one-time only.

  • i'd do it like this:

    1. read yr orig SQL tbl into a DataSet, then to Table object (ADO.NET)

    2. use RegEx object to match all occurrences of yr targeted strings ('dog' etc) in the Table 0bject (VB.NET); create a Match Collection (array)

    3. output the array as a tbl back to SQL Server; add an identity field to the tbl (ADO.NET)

    4. group all the matches in the tbl to c how many of each there is. (SQL Server)

  • Why not just correct the design on the server one time and then just do a simple agregate query?

  • Here is one attempt:  [Note, I have inserted the values you provided - you will want to insert your own values from your table....] 

    [Remi is correct.  You may want to use this code to correct the design and put the values in a physical table rather than the #TempTables I create...]

    SET NOCOUNT ON

    DECLARE @ListID integer,

                    @PipeList varchar(255),

                    @Delimiter char(1),

                    @LeftDelimter smallint,

                    @Word varchar(50),

                    @List varchar(255)

    CREATE TABLE #Parse( ListID integer IDENTITY(1,1),

                                         List varchar(255))

    -- you would actually insert all of your values from your table here... 

    --INSERT INTO #Parse SELECT * FROM #Table

    INSERT INTO #Parse( List) VALUES( 'cat, dog, mouse')

    INSERT INTO #Parse( List) VALUES( 'mouse, dog, fox, hen')

    CREATE TABLE #List( Word varchar(255) NOT NULL)

    SELECT @ListID = (SELECT MIN( ListID) FROM #Parse)

         WHILE @ListID <= (SELECT MAX( ListID) FROM #Parse)

              BEGIN

                   SELECT @PipeList = (SELECT List FROM #Parse WHERE ListID = @ListID)

                   SELECT @LeftDelimter = 1,

                                 @Delimiter = ',',

                                 @List = @Delimiter + @PipeList + @Delimiter

                   WHILE CHARINDEX( @Delimiter, @List, @LeftDelimter + 1) > 0

                        BEGIN

                             SELECT @Word = SUBSTRING( @List, @LeftDelimter + 1, CHARINDEX( @Delimiter, @List, @LeftDelimter + 1) - ( @LeftDelimter + 1))

                             IF LTRIM( RTRIM( @Word)) <> ''

                                  BEGIN

                                       INSERT INTO #List( Word) VALUES( @Word)

                                  END

                                  SELECT @LeftDelimter = CHARINDEX( @Delimiter, @List, @LeftDelimter + 1)

                        END

              SELECT @ListID = (SELECT MIN( ListID) FROM #Parse WHERE ListID > @ListID)

              END

    SELECT Word, COUNT( Word)

    FROM #List

    GROUP BY Word

    DROP TABLE #Parse

    DROP TABLE #List

    I wasn't born stupid - I had to study.

  • Q to Farrell:

    pls explain what's going to happen it a delimiter pattern changes from ', ' to say ' , ' in some parts of the text? WIll yr code still be working OK?

    Thanks

  • I did find a problem.  The insert line should read: 

    INSERT INTO #List( Word) VALUES( RTRIM( LTRIM( @Word)))

    You do need to trim leading and trailing values as you correctly point out... 

    I wasn't born stupid - I had to study.

  • No he needs to correct the design... this is just the beginning of his problems.

    And yes farrell, this is one problem where no set solution exists .

  • Thanks!  I did agree, this should be re-designed.  But my code may help do that by inserting into a Physical table rather than a #TempTable.  The rest is as you said, a simple select, count, and group...

    I wasn't born stupid - I had to study.

  • Okay...I thought of the redesign too, but the situation warrants this design. Why is it that everyone always assumes there is no place for this design?

    I got the solution I select a list of possible values and then select the count for each of those values all in ado.net.  It is a procedure that will only be executed once a day for like a week.

    The reason the design is required is complicated, but it is part of a survey generating/tabulating application.  The possible question types are radio button, textbox, dropdown list and checkboxes.  The checkboxes cause the disdain, but as I said, it really isn't a high production application so this solution is fine.

    Thanks anyway,

    Paul

  • ADO.NET rocks

  • Because these problems arise more than you would think.  Once something like this gets in place, it can creep in and become part of final design.  Then you have to jump all these hoops to get simple questions answered. 

    I have a number of legacy systems that have these kinds of problems.  It is a real booger to change something that other systems depend upon.  And when someone new comes in to try and read and understand the code - phewey!! 

    I wasn't born stupid - I had to study.

  • Let me put it this way, if this had been designed correctly in the first place you wouldn't even have had to come here for help and you'd've been done a few hours ago. Even if this is not critical to your compagny now, it may well become someday and then you're screwed.

    Denormalization is a no-no unless you're using a reporting perm table that holds data for a few minutes/hours at the time. Other than that you're better off banishing that design style from your repertoire.

  • what about a situation when the data is coming from outside ( say as some lousy ecxel file) and u just need to run a quick stats on it (seems to be the case here). Does it warrant the down-to-bare-bones normalization and redisign of the data, if all u need is just 50 smth lines of code to process?

    Thanks,

    Sergei

  • the problem is the data is coming from all directions and it's never of the same structure. I'm talking about text processing.

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply