count occurrences/frequency in a multiple value column

Question

count occurrences/frequency in a multiple value column

rogernc777

SSC Enthusiast

Points: 174
More actions
November 12, 2020 at 5:28 pm
Go to Answer
#3807243

Hi every one,
Wondering how to extract the number of occurrence in a multiple value column. See example below:
Table User_Language
User | Language
Person 1 English, Spanish, French
Person 2 Advanced English, Intermidiate Spanish
Person 3 English, Intermidiate English
Person 4 Japanese, Fluent English
Person 5 Advanced English, Spanish
Person 6 Advanced English
Expected Result
Languange | Frequency
Advanced English 3
English 2
Spanish 2
French 1
Intermidiate Spanish 1
Intermidiate English 1
Fluent English 1
Japanese 1
For instance, have tried CASE and IFs, not being able to achieve the result above, since the case statement will count only once per row.
Need SQL clause to retrieve the info from the table User_Language

Mr. Brian Gale

SSC-Insane

Points: 24799
More actions
November 12, 2020 at 8:11 pm
Answer
#3807293
If you have a short list of languages and don't mind a bunch of SUM and CASE statements, you could get by using those
Such as:
```
DECLARE @tmpTbl TABLE
(
[USERname] VARCHAR(255)
, [LANG]   VARCHAR(255)
);
INSERT INTO @tmpTbl
VALUES
(
'p1'
, 'english,spanish,french'
)
, (
  'p2'
  , 'advanced english, intermediate spanish'
  )
, (
  'p3'
  , 'english,advanced english'
  );

SELECT
[Languages].[Lang]
, [Languages].[LangCount]
 FROM
(
SELECT(SUM(CASE
WHEN [LANG] LIKE '%english%'
 THEN 1
ELSE 0
END
) - SUM(   CASE
   WHEN [LANG] LIKE '%Advanced English%'
THEN 1
   ELSE 0
   END
   ) - SUM(  CASE
  WHEN [LANG] LIKE '%Intermediate English%'
   THEN 1
  ELSE 0
  END
  )
)   AS [English]
, SUM( CASE
 WHEN [LANG] LIKE '%Advanced English%'
  THEN 1
 ELSE 0
 END
 ) AS [Advanced English]
 FROM@tmpTbl
) AS [RawData]
UNPIVOT
(
[LangCount]
FOR [Lang] IN (
  [English]
  , [Advanced English]
  )
) AS [Languages];
```
You would need to expand on that to cover more languages, but gives you a starting point where you don't need to use a string splitter. A HUGE advantage of the string splitter method presented by pietliden is that it is dynamic whereas my approach is very manual. What I mean is apparent when you look at the result based on the input data. In pietliden's solution, you would get all of the languages in the original source list. With my approach you only get the ones you are explicitly looking for. With the sample set I did, there are 5 distinct languages, but my result set only shows 2 and you would need to add more SUM and CASE statements to handle them.
My intent was not to provide a better solution (mine requires more hand-holding as the language list grows), just to provide a different approach.
Mind you, my preference is to not store the data that way to begin with. My opinion, it makes more sense to have a languages table and a persons table and then a table that matches the 2 up. The way it is now, if you decided that you wanted to add a new languages to a person, you need to check the length of the language list already in place (not sure on the data type you have there), and you may get poor performance on inserts and updates as you may cause page splits. On top of that, the data is not normalized. Lastly, it is not very search friendly data. If you wanted a list of all users who have "English" as a language (for example), you are now needing to do your WHERE statement with something like:
```
WHERE [Language] LIKE 'English%'
    OR [Language] LIKE '%, English%'
    OR [Language] LIKE '%,English%
```
Now, had you normalized it and had one table for users, one table for languages and one mapping table, you could do something like:
```
SELECT User.Name
FROM User
Join User_Language ON User.ID = User_Language.UserID
JOIN Language ON User_Langugae.LanguageID = Language.ID
WHERE Language.Name = 'English'
```
And for your original query, you could end up with something like:
```
SELECT Language.Name, COUNT(User_Langugae.LanguageID)
FROM Language
JOIN User_Language on Language.ID = User_Language.LanguageID
```
With the "User_Language" table you also get the advantage that it is trivial to remove data duplication.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63535 More actions · Answer 1

You have to split the values first, so you have a normalized structure so you can do the aggregation. I used Jeff Moden's DelimitedSplit8K function

SELECT Lang2, COUNT(*) AS Freq
FROM
(SELECT SpeakerID
, Lang2 = TRIM(ds.Item)
FROM
(SELECT SpeakerID
, Lang1 = REPLACE(REPLACE(REPLACE(Lang,'Advanced ',''),'Intermidiate',''),'Fluent','')
FROM #Langs) la
CROSS APPLY Testdb.dbo.DelimitedSplit8K(Lang1,',') ds) lst
GROUP BY lst.Lang2

ScottPletcher SSC Guru Points: 101148 More actions · Answer 2

As noted by Brian, your table design is completely wrong. You've then had to corrupt the data to fit it into the bad table structure.

For example, the language is "English". There is no language named "Advanced English" or "Intermediate English". Since you need to keep info about the combination of a User and Language, then data modeling rules require that you have a separate table.

With the proper table structure, you would have something like this:

CREATE TABLE dbo.User_Languages (

user_id int NOT NULL ... FOREIGN KEY dbo.users ( user_id )...,

language_id smallint NOT NULL ... FOREIGN KEY dbo.languages ( language_id )...,

language_proficiency_id tinyint NOT NULL ... FOREIGN KEY dbo.language_proficiencies ( language_proficiency_id )...,

/* where, for example, 0=minimal; 1=written, but not conversational; 50=intermediate; 100=advanced; ...; 200=fluent; 240=native speaker */

...

)

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

damir_60 SSC Enthusiast Points: 139 More actions · Answer 3

My answer for MS SQL 2016-2019:

create table User_Language
(
PersonName nvarchar(50) not null 
, PersonLang nvarchar(150) not null 
)
go 
insert into dbo.User_Language ( PersonName, PersonLang )
values 
(N'Person 1', N'English, Spanish, French')
, (N'Person 2 ', N'Advanced English, Intermidiate Spanish')
, (N'Person 3', N'English, Intermidiate English')
, (N'Person 4', N'Japanese, Fluent English')
, (N'Person 5', N'Advanced English, Spanish')
, (N'Person 6', N'Advanced English')
go 
; with cte as 
(
select ul.PersonName, ss.Lang 
from dbo.User_Language as ul 
cross apply 
( 
select ltrim(rtrim(value ))as Lang
from string_split(ul.PersonLang, N',')
) as ss
)
select cte.Lang, count(cte.Lang)as CntLang
from cte 
group by cte.Lang

This reply was modified 4 years, 3 months ago by damir_60.

rogernc777 SSC Enthusiast Points: 174 More actions · Answer 4

Hi guys, Thank you for all the solutions presented and very good explanations.

Totally agree with you.

I doing a freelance job and I need to report using such Bad data Structure, in other words I don´t have control,neither my client, since it is a proprietary solution.

As mentioned by Mr. Brian Gale pietliden's solution is a very nice approach, but I´ll need to check if it works on IBM Cognos, which is pointed to SQL Server and uses database native functions. If it not works Mr. Brian Gale solution might come in handy, since I can User COGNOS native SQL or T-SQL to retreive the data.

The good thing about this post it that I can give my client a more assertive argument for its data Structure and the workarounds we need to implement to resolve the issue.

Again, thank you very much!

Best Regards,

Roger NC

Jeffrey Williams SSC Guru Points: 90338 More actions · Answer 5

You can use the built-in STRING_SPLIT function:

 --==== Sample data
Declare @testTable Table (UserName varchar(20), UserLanguage varchar(255))
 Insert Into @testTable (UserName, UserLanguage)
 Values ('Person 1', 'English,Spanish,French')
      , ('Person 2', 'Advanced English,Intermediate Spanish')
      , ('Person 3', 'English,Intermediate English')
      , ('Person 4', 'Japanese,Fluent English')
      , ('Person 5', 'Advanced English,Spanish')
      , ('Person 6', 'Advanced English');

 --==== Solution, note: Intermediate vs Intermidiate 
 Select UserLanguage = ltrim(rtrim(ss.value))  -- trim(ss.value) --SQL Server 2017 or higher
      , Frequency = count(*)
   From @testTable tt
  Cross Apply string_split(tt.UserLanguage, ',') ss
  Group By
        ss.value
  Order By
        Frequency desc;

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

count occurrences/frequency in a multiple value column

Cookies on SQLServerCentral