Getting most common value from several fields into one record

Question

Getting most common value from several fields into one record

kramaswamy

SSCoach

Points: 18181
More actions
May 20, 2010 at 9:03 am

#141051

Here's the example that I'm working with:
CREATE TABLE #TestTable
(
ID INT,
Field1 VARCHAR(MAX),
Field2 VARCHAR(MAX),
Field3 VARCHAR(MAX),
Field4 VARCHAR(MAX),
Field5 VARCHAR(MAX)
)
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'a', 'b', 'c', 'd', 'l')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'a', 'b', 'e', 'f', 'g')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'b', 'b', 'c', 'g', 'l')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'c', 'b', 'c', 'f', 'e')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'd', 'e', 'c', 'i', 'r')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'd', 'f', 'd', 'j', 'q')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'i', 'a', 'g', 'j', 'q')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'e', 'c', 'a', 'j', 'e')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'f', 'e', 'f', 'j', 'q')
SELECT * FROM #TestTable
CREATE TABLE #ResultTable
(
ID INT,
Field1 VARCHAR(MAX),
Field2 VARCHAR(MAX),
Field3 VARCHAR(MAX),
Field4 VARCHAR(MAX),
Field5 VARCHAR(MAX)
)
INSERT INTO #ResultTable
SELECT DISTINCT ID, '', '', '', '', ''
FROM #TestTable
SELECT * FROM #ResultTable
DROP TABLE #TestTable
DROP TABLE #ResultTable
What I'm trying to accomplish is this. Basically, let's say I have several columns in a table which is identified with a non-unique identifier. What I want to do is get the values from each column which are the most commonly occurring values for each identifier.
So, from the sample data which I provided above, what the result should be in #ResultTable, is:
ID Field1 Field2 Field3 Field4 Field5
1 a b c f l
2 d e ! j q
The ! represents an "unknown". For the purposes of this example, we can just assume any value will suffice.
I can't really come up with an efficient way of doing this. The best I can think of, is to create a CTE for each field, and do something like:
WITH
cte1 AS
(
SELECT ID, Field1, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field1
),
cte2 AS
(
SELECT ID, Field2, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field2
),
cte3 AS
(
SELECT ID, Field3, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field3
),
cte4 AS
(
SELECT ID, Field4, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field4
),
cte5 AS
(
SELECT ID, Field5, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field5
)
UPDATE r
SET r.Field1 = c1.Field1,
r.Field2 = c2.Field2,
r.Field3 = c3.Field3,
r.Field4 = c4.Field4,
r.Field5 = c5.Field5
FROM #ResultTable r
JOIN cte1 c1 ON c1.ID = r.ID AND c1.rowNum = 1
JOIN cte2 c2 ON c2.ID = r.ID AND c2.rowNum = 1
JOIN cte3 c3 ON c3.ID = r.ID AND c3.rowNum = 1
JOIN cte4 c4 ON c4.ID = r.ID AND c4.rowNum = 1
JOIN cte5 c5 ON c5.ID = r.ID AND c5.rowNum = 1
SELECT * FROM #ResultTable

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 1

I would create aggregate CLR function for this in C#.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Ben Teraberry SSChampion Points: 11845 More actions · Answer 2

A CLR would definitely work, but I don't think it's necessary here.

I used a table variable instead of temp table for my own reasons but you can do whatever on your system, of course.

declare @t_temp table

(

ID INT,

Field1 VARCHAR(MAX),

Field2 VARCHAR(MAX),

Field3 VARCHAR(MAX),

Field4 VARCHAR(MAX),

Field5 VARCHAR(MAX)

)

INSERT INTO @t_temp(ID, Field1, Field2, Field3, Field4, Field5)

select 1, 'a', 'b', 'c', 'd', 'l'

unionselect 1, 'a', 'b', 'e', 'f', 'g'

unionselect 1, 'b', 'b', 'c', 'g', 'l'

unionselect 1, 'c', 'b', 'c', 'f', 'e'

unionselect 2, 'd', 'e', 'c', 'i', 'r'

unionselect 2, 'd', 'f', 'd', 'j', 'q'

unionselect 2, 'i', 'a', 'g', 'j', 'q'

unionselect 2, 'e', 'c', 'a', 'j', 'e'

unionselect 2, 'f', 'e', 'f', 'j', 'q';

This is one way to do what you're trying to accomplish:

with cteTemp (ID, Field, FieldVal, FieldCount)

as

(

select ID,

1 as Field,

Field1 as FieldVal,

COUNT(Field1) as FieldCount

from @t_temp

group by ID, Field1

union

select ID,

2,

Field2,

COUNT(Field2)

from @t_temp

group by ID, Field2

union

select ID,

3,

Field3,

COUNT(Field3)

from @t_temp

group by ID, Field3

union

select ID,

4,

Field4,

COUNT(Field4)

from @t_temp

group by ID, Field4

union

select ID,

5,

Field5,

COUNT(Field5)

from @t_temp

group by ID, Field5

)

select ID,

isnull(max([1]), '!') as Field1,

isnull(max([2]), '!') as Field2,

isnull(max([3]), '!') as Field3,

isnull(max([4]), '!') as Field4,

isnull(max([5]), '!') as Field5

from

(

select ID,

Field,

FieldVal,

FieldCount

from cteTemp c

where not exists

(

select sq.FieldVal

from cteTemp sq

where sq.ID = c.ID

and sq.Field = c.Field

and sq.FieldVal <> c.FieldVal

and sq.FieldCount >= c.FieldCount

)

) as pvt

pivot

(

MIN(FieldVal)

FOR Field IN ([1],[2],[3],[4],[5])

) as PivotTable

group by ID;

There are certainly some improvements you could make, but hopefully you get the idea.

└> bt

Forum Etiquette: How to post data/code on a forum to get the best help[/url]

kramaswamy SSCoach Points: 18181 More actions · Answer 3

*nod* this one is pretty similar to the one that I listed.

Basically what I was hoping for was a solution which would not require me to run X repeat queries against the same set of data. I mean, thinking about it, I can't really come up with any logical way that you could do what I'm asking for without running repeated queries, but if someone else could, it would be awesome 😛

Ben Teraberry SSChampion Points: 11845 More actions · Answer 4

Not sure what you mean by repeated queries.

└> bt

Forum Etiquette: How to post data/code on a forum to get the best help[/url]

kramaswamy SSCoach Points: 18181 More actions · Answer 5

Both our solutions are doing N number of queries against the same table, in order to get the most common occurring value for each column.

Ben Teraberry SSChampion Points: 11845 More actions · Answer 6

kramaswamy (5/20/2010)
Both our solutions are doing N number of queries against the same table, in order to get the most common occurring value for each column.

Maybe I'm not understanding what you're actually trying to do. If there are a static 5 columns this really doesn't matter. If you have a variable number of columns you may want to look at using dynamic sql. Independent of that, if it bothers you to have separate select statements you can use UNPIVOT instead. I don't think that it will make a performance difference however.

└> bt

Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 7

elutin (5/20/2010)
I would create aggregate CLR function for this in C#.

Cool... lets see it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 8

Details are in the comments folks. Heh... ya wanted to get rid of "repeating code", right?

--===== Conditionally drop the test table. It makes reruns easier

-- and leaves a persistent table for easier troubleshooting

-- than a table variable.

IF OBJECT_ID('TempDB..#TestTable','U') IS NOT NULL

DROP TABLE #TestTable

--===== Recreate the test table.

CREATE TABLE #TestTable

(

ID INT,

Field1 VARCHAR(MAX),

Field2 VARCHAR(MAX),

Field3 VARCHAR(MAX),

Field4 VARCHAR(MAX),

Field5 VARCHAR(MAX)

)

;

--===== Populate the test table... I added an extra scenario in ID 3

INSERT INTO #TestTable

(ID, Field1, Field2, Field3, Field4, Field5)

SELECT 1, 'a', 'b', 'c', 'd', 'l' UNION ALL

SELECT 1, 'a', 'b', 'e', 'f', 'g' UNION ALL

SELECT 1, 'b', 'b', 'c', 'g', 'l' UNION ALL

SELECT 1, 'c', 'b', 'c', 'f', 'e' UNION ALL

SELECT 2, 'd', 'e', 'c', 'i', 'r' UNION ALL

SELECT 2, 'd', 'f', 'd', 'j', 'q' UNION ALL

SELECT 2, 'i', 'a', 'g', 'j', 'q' UNION ALL

SELECT 2, 'e', 'c', 'a', 'j', 'e' UNION ALL

SELECT 2, 'f', 'e', 'f', 'j', 'q' UNION ALL

SELECT 3, 'd', 'e', 'c', 'i', 'r' UNION ALL

SELECT 3, 'd', 'f', 'c', 'j', 'q' UNION ALL

SELECT 3, 'i', 'a', 'g', 'j', 'q' UNION ALL

SELECT 3, 'e', 'c', 'g', 'j', 'e' UNION ALL

SELECT 3, 'f', 'e', 'f', 'j', 'q'

;

--===== Solve the problem. This gets rid of 119 logical reads and 8 scans

-- which are caused by a "Triangular Join" in the other methods.

WITH

cteUnpivotAndCount AS

( --=== Unpivot the data and count the occurances at the same time.

SELECT unpvt.ID, unpvt.Field, unpvt.Value, COUNT(*) AS ValueCount

FROM (SELECT ID, Field1, Field2, Field3, Field4, Field5 FROM #TestTable) src

UNPIVOT (Value FOR Field IN (Field1, Field2, Field3, Field4, Field5))AS unpvt

GROUP BY unpvt.ID, unpvt.Field, unpvt.Value

)

,

cteTotalItems AS

( --=== Count the number of rows for each ID

SELECT ID, COUNT(*) AS TotalCount

FROM #TestTable

GROUP BY ID

)

,

cteRank AS

( --=== Create "opposing" rank columns

SELECT uac.ID, uac.Field, uac.Value, uac.ValueCount,

ROW_NUMBER() OVER (PARTITION BY uac.ID, uac.Field ORDER BY uac.ValueCount ASC) AS RevRank,

ROW_NUMBER() OVER (PARTITION BY uac.ID, uac.Field ORDER BY uac.ValueCount DESC) AS FwdRank,

ti.TotalCount

FROM cteUnpivotAndCount uac

INNER JOIN cteTotalItems ti

ON uac.ID = ti.ID

) --=== We have enough information for the criteria to work now. A Cross Tab like this is a little faster than PIVOT.

SELECT ID,

MAX(CASE WHEN Field = 'Field1' THEN Value ELSE '!' END) AS Field1,

MAX(CASE WHEN Field = 'Field2' THEN Value ELSE '!' END) AS Field2,

MAX(CASE WHEN Field = 'Field3' THEN Value ELSE '!' END) AS Field3,

MAX(CASE WHEN Field = 'Field4' THEN Value ELSE '!' END) AS Field4,

MAX(CASE WHEN Field = 'Field5' THEN Value ELSE '!' END) AS Field5

FROM cteRank

WHERE (FwdRank = 1 AND ValueCount + RevRank + FwdRank -2 >= TotalCount)

GROUP BY ID

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 9

And what I wanted was:

select ID

,dbo.MostCommonValue(Field1)

,dbo.MostCommonValue(Field2)

,dbo.MostCommonValue(Field3)

,dbo.MostCommonValue(Field4)

,dbo.MostCommonValue(Field5)

from #t_temp

group by ID

Easy hey?

Now, only what is left is just to write CLR.

OK, it may be not the best implementation, but...

Here we are:

using System;

using System.Collections.Generic;

using System.Collections;

using System.Text;

using Microsoft.SqlServer.Server;

using System.Data.SqlTypes;

using System.IO;

namespace StrAggr

{

[Serializable]

[SqlUserDefinedAggregate(

Format.UserDefined, //use custom serialization to serialize the intermediate result

IsInvariantToNulls = true, //optimizer property

IsInvariantToDuplicates = false, //optimizer property

IsInvariantToOrder = false, //optimizer property

MaxByteSize = 8000) //maximum size in bytes of persisted value

]

public class MostCommonValue : IBinarySerialize

{

private Dictionary<string, int> vals;

public void Init()

{

this.vals = new Dictionary<string, int>();

}

// If the next value is not null, and not in dictionary already - add it, if it's in dictionary - increase count

public void Accumulate(SqlString value)

{

if (value.IsNull)

{

return;

}

if (this.vals.ContainsKey(value.Value))

{

this.vals[value.Value]++;

}

else

{

this.vals.Add(value.Value, 1);

}

//Merges the partial aggregate with this aggregate

public void Merge(MostCommonValue part)

{

foreach (KeyValuePair<string, int> item in part.vals)

{

if (this.vals.ContainsKey(item.Key))

{

this.vals[item.Key] = this.vals[item.Key] + item.Value;

}

else

{

this.vals.Add(item.Key, item.Value);

}

//Returns the result of the aggregation when finished

public SqlString Terminate()

{

//there are other ways of finding maximum by using List and Sort,

//but SQL Server doesn't like using delegates...

int maxOcc = 0;

string res = "!";

if (this.vals != null)

{

foreach (KeyValuePair<string, int> item in this.vals)

{

if (maxOcc < item.Value)

{

res = item.Key;

maxOcc = item.Value;

}

else if (maxOcc == item.Value)

{

res = "!";

}

return new SqlString(res);

}

public void Write(BinaryWriter writer)

{

//would be really helpfull if Dictionary was serializable....

//we need to simulate serialization here.

//probably, better implemenation required (I hate to choose separators...)

StringBuilder sb = new StringBuilder();

foreach (KeyValuePair<string, int> item in this.vals)

{

sb.Append(item.Key);

sb.Append("{:}");

sb.Append(item.Value.ToString());

sb.Append("{-}");

}

writer.Write(sb.ToString());

}

public void Read(BinaryReader reader)

{

this.vals = new Dictionary<string, int>();

string sVals = reader.ReadString();

string [] valKP = sVals.Split(new string[]{"{-}"},StringSplitOptions.None);

foreach (string kp in valKP)

{

string[] valpair = kp.Split(new string[] { "{:}" }, StringSplitOptions.None);

if (valpair.Length==2) this.vals.Add(valpair[0], Int32.Parse(valpair[1]));

}

Compile it into library and registry assemply:

-- just in case if it's disabled

EXEC sp_configure 'clr enabled', 1;

GO

RECONFIGURE WITH OVERRIDE;

GO

--DROP AGGREGATE MostCommonValue

--DROP ASSEMBLY StrAggr

GO

CREATE ASSEMBLY StrAggr FROM 'C:\_temp\SQLDLL\straggr.dll'

GO

CREATE AGGREGATE MostCommonValue (@input nvarchar(4000)) RETURNS nvarchar(4000)

EXTERNAL NAME StrAggr.[StrAggr.MostCommonValue]

GO

Now you can try my first query...

Cheers,

Me

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

Eugene Elutin SSC Guru Points: 59322 More actions · Answer 10

I have an idea for Read and Write method which will not use any separators to simulate serialization of Dictionary. I will do it on Monday.

Is anyone know the best way to serialize Dictionary object in C#? I don't need support fo generic Dictionary. It will be ok just for string & int KeyValuePairs.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help[/url]

kramaswamy SSCoach Points: 18181 More actions · Answer 11

I like your solution a lot Jeff, definitely avoids the necessity of adding a new CTE for every column you have. Only thing that needs to be added now is another set of columns in the CTE and in the SELECT query.

I've modified it slightly though. I noticed you had invested a decent amount of code into handling scenarios where there is not one distinct value which is the highest in count. This scenario is not actually important to me - I'm fine with choosing any one of them. So, here's my modification of your code:

--===== Conditionally drop the test table. It makes reruns easier

-- and leaves a persistent table for easier troubleshooting

-- than a table variable.

IF OBJECT_ID('TempDB..#TestTable','U') IS NOT NULL

DROP TABLE #TestTable

--===== Recreate the test table.

CREATE TABLE #TestTable

(

ID INT,

Field1 VARCHAR(MAX),

Field2 VARCHAR(MAX),

Field3 VARCHAR(MAX),

Field4 VARCHAR(MAX),

Field5 VARCHAR(MAX)

)

;

--===== Populate the test table... I added an extra scenario in ID 3

INSERT INTO #TestTable (ID, Field1, Field2, Field3, Field4, Field5)

SELECT 1, 'a', 'b', 'c', 'd', 'l' UNION ALL

SELECT 1, 'a', 'b', 'e', 'f', 'g' UNION ALL

SELECT 1, 'b', 'b', 'c', 'g', 'l' UNION ALL

SELECT 1, 'c', 'b', 'c', 'f', 'e' UNION ALL

SELECT 2, 'd', 'e', 'c', 'i', 'r' UNION ALL

SELECT 2, 'd', 'f', 'd', 'j', 'q' UNION ALL

SELECT 2, 'i', 'a', 'g', 'j', 'q' UNION ALL

SELECT 2, 'e', 'c', 'a', 'j', 'e' UNION ALL

SELECT 2, 'f', 'e', 'f', 'j', 'q' UNION ALL

SELECT 3, 'd', 'e', 'c', 'i', 'r' UNION ALL

SELECT 3, 'd', 'f', 'c', 'j', 'q' UNION ALL

SELECT 3, 'i', 'a', 'g', 'j', 'q' UNION ALL

SELECT 3, 'e', 'c', 'g', 'j', 'e' UNION ALL

SELECT 3, 'f', 'e', 'f', 'j', 'q'

;

--===== Solve the problem. This gets rid of 119 logical reads and 8 scans

-- which are caused by a "Triangular Join" in the other methods.

WITH cteUnpivotAndCount AS

( --=== Unpivot the data and count the occurances at the same time.

SELECT

unpvt.ID,

unpvt.Field,

unpvt.Value,

COUNT(*) AS ValueCount,

ROW_NUMBER() OVER (PARTITION BY unpvt.ID, unpvt.Field ORDER BY COUNT(*) DESC) AS FwdRank

FROM (SELECT ID, Field1, Field2, Field3, Field4, Field5 FROM #TestTable) src

UNPIVOT (Value FOR Field IN (Field1, Field2, Field3, Field4, Field5)) AS unpvt

GROUP BY unpvt.ID, unpvt.Field, unpvt.Value

)

SELECT

ID,

MAX(CASE WHEN Field = 'Field1' THEN Value END) AS Field1,

MAX(CASE WHEN Field = 'Field2' THEN Value END) AS Field2,

MAX(CASE WHEN Field = 'Field3' THEN Value END) AS Field3,

MAX(CASE WHEN Field = 'Field4' THEN Value END) AS Field4,

MAX(CASE WHEN Field = 'Field5' THEN Value END) AS Field5

FROM cteUnpivotAndCount

WHERE FwdRank = 1

GROUP BY ID

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 12

kramaswamy (5/25/2010)
This scenario is not actually important to me - I'm fine with choosing any one of them.

Heh... now that would have been nice to know up front. 😛 Thanks for the feedback and the code, kramaswamy. I appreciate the time and I'm glad I could help.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)