May 20, 2010 at 9:03 am
Here's the example that I'm working with:
CREATE TABLE #TestTable
(
ID INT,
Field1 VARCHAR(MAX),
Field2 VARCHAR(MAX),
Field3 VARCHAR(MAX),
Field4 VARCHAR(MAX),
Field5 VARCHAR(MAX)
)
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'a', 'b', 'c', 'd', 'l')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'a', 'b', 'e', 'f', 'g')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'b', 'b', 'c', 'g', 'l')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(1, 'c', 'b', 'c', 'f', 'e')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'd', 'e', 'c', 'i', 'r')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'd', 'f', 'd', 'j', 'q')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'i', 'a', 'g', 'j', 'q')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'e', 'c', 'a', 'j', 'e')
INSERT INTO #TestTable(ID, Field1, Field2, Field3, Field4, Field5)
VALUES(2, 'f', 'e', 'f', 'j', 'q')
SELECT * FROM #TestTable
CREATE TABLE #ResultTable
(
ID INT,
Field1 VARCHAR(MAX),
Field2 VARCHAR(MAX),
Field3 VARCHAR(MAX),
Field4 VARCHAR(MAX),
Field5 VARCHAR(MAX)
)
INSERT INTO #ResultTable
SELECT DISTINCT ID, '', '', '', '', ''
FROM #TestTable
SELECT * FROM #ResultTable
DROP TABLE #TestTable
DROP TABLE #ResultTable
What I'm trying to accomplish is this. Basically, let's say I have several columns in a table which is identified with a non-unique identifier. What I want to do is get the values from each column which are the most commonly occurring values for each identifier.
So, from the sample data which I provided above, what the result should be in #ResultTable, is:
ID Field1 Field2 Field3 Field4 Field5
1 a b c f l
2 d e ! j q
The ! represents an "unknown". For the purposes of this example, we can just assume any value will suffice.
I can't really come up with an efficient way of doing this. The best I can think of, is to create a CTE for each field, and do something like:
WITH
cte1 AS
(
SELECT ID, Field1, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field1
),
cte2 AS
(
SELECT ID, Field2, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field2
),
cte3 AS
(
SELECT ID, Field3, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field3
),
cte4 AS
(
SELECT ID, Field4, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field4
),
cte5 AS
(
SELECT ID, Field5, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY COUNT(*) DESC) AS rowNum
FROM #TestTable
GROUP BY ID, Field5
)
UPDATE r
SET r.Field1 = c1.Field1,
r.Field2 = c2.Field2,
r.Field3 = c3.Field3,
r.Field4 = c4.Field4,
r.Field5 = c5.Field5
FROM #ResultTable r
JOIN cte1 c1 ON c1.ID = r.ID AND c1.rowNum = 1
JOIN cte2 c2 ON c2.ID = r.ID AND c2.rowNum = 1
JOIN cte3 c3 ON c3.ID = r.ID AND c3.rowNum = 1
JOIN cte4 c4 ON c4.ID = r.ID AND c4.rowNum = 1
JOIN cte5 c5 ON c5.ID = r.ID AND c5.rowNum = 1
SELECT * FROM #ResultTable
May 20, 2010 at 9:57 am
I would create aggregate CLR function for this in C#.
May 20, 2010 at 12:04 pm
A CLR would definitely work, but I don't think it's necessary here.
I used a table variable instead of temp table for my own reasons but you can do whatever on your system, of course.
declare @t_temp table
(
ID INT,
Field1 VARCHAR(MAX),
Field2 VARCHAR(MAX),
Field3 VARCHAR(MAX),
Field4 VARCHAR(MAX),
Field5 VARCHAR(MAX)
)
INSERT INTO @t_temp(ID, Field1, Field2, Field3, Field4, Field5)
select 1, 'a', 'b', 'c', 'd', 'l'
unionselect 1, 'a', 'b', 'e', 'f', 'g'
unionselect 1, 'b', 'b', 'c', 'g', 'l'
unionselect 1, 'c', 'b', 'c', 'f', 'e'
unionselect 2, 'd', 'e', 'c', 'i', 'r'
unionselect 2, 'd', 'f', 'd', 'j', 'q'
unionselect 2, 'i', 'a', 'g', 'j', 'q'
unionselect 2, 'e', 'c', 'a', 'j', 'e'
unionselect 2, 'f', 'e', 'f', 'j', 'q';
This is one way to do what you're trying to accomplish:
with cteTemp (ID, Field, FieldVal, FieldCount)
as
(
select ID,
1 as Field,
Field1 as FieldVal,
COUNT(Field1) as FieldCount
from @t_temp
group by ID, Field1
union
select ID,
2,
Field2,
COUNT(Field2)
from @t_temp
group by ID, Field2
union
select ID,
3,
Field3,
COUNT(Field3)
from @t_temp
group by ID, Field3
union
select ID,
4,
Field4,
COUNT(Field4)
from @t_temp
group by ID, Field4
union
select ID,
5,
Field5,
COUNT(Field5)
from @t_temp
group by ID, Field5
)
select ID,
isnull(max([1]), '!') as Field1,
isnull(max([2]), '!') as Field2,
isnull(max([3]), '!') as Field3,
isnull(max([4]), '!') as Field4,
isnull(max([5]), '!') as Field5
from
(
select ID,
Field,
FieldVal,
FieldCount
from cteTemp c
where not exists
(
select sq.FieldVal
from cteTemp sq
where sq.ID = c.ID
and sq.Field = c.Field
and sq.FieldVal <> c.FieldVal
and sq.FieldCount >= c.FieldCount
)
) as pvt
pivot
(
MIN(FieldVal)
FOR Field IN ([1],[2],[3],[4],[5])
) as PivotTable
group by ID;
There are certainly some improvements you could make, but hopefully you get the idea.
May 20, 2010 at 12:08 pm
*nod* this one is pretty similar to the one that I listed.
Basically what I was hoping for was a solution which would not require me to run X repeat queries against the same set of data. I mean, thinking about it, I can't really come up with any logical way that you could do what I'm asking for without running repeated queries, but if someone else could, it would be awesome 😛
May 20, 2010 at 12:16 pm
May 20, 2010 at 12:34 pm
Both our solutions are doing N number of queries against the same table, in order to get the most common occurring value for each column.
May 20, 2010 at 1:51 pm
kramaswamy (5/20/2010)
Both our solutions are doing N number of queries against the same table, in order to get the most common occurring value for each column.
Maybe I'm not understanding what you're actually trying to do. If there are a static 5 columns this really doesn't matter. If you have a variable number of columns you may want to look at using dynamic sql. Independent of that, if it bothers you to have separate select statements you can use UNPIVOT instead. I don't think that it will make a performance difference however.
May 20, 2010 at 8:11 pm
elutin (5/20/2010)
I would create aggregate CLR function for this in C#.
Cool... lets see it.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 20, 2010 at 10:35 pm
Details are in the comments folks. Heh... ya wanted to get rid of "repeating code", right?
--===== Conditionally drop the test table. It makes reruns easier
-- and leaves a persistent table for easier troubleshooting
-- than a table variable.
IF OBJECT_ID('TempDB..#TestTable','U') IS NOT NULL
DROP TABLE #TestTable
--===== Recreate the test table.
CREATE TABLE #TestTable
(
ID INT,
Field1 VARCHAR(MAX),
Field2 VARCHAR(MAX),
Field3 VARCHAR(MAX),
Field4 VARCHAR(MAX),
Field5 VARCHAR(MAX)
)
;
--===== Populate the test table... I added an extra scenario in ID 3
INSERT INTO #TestTable
(ID, Field1, Field2, Field3, Field4, Field5)
SELECT 1, 'a', 'b', 'c', 'd', 'l' UNION ALL
SELECT 1, 'a', 'b', 'e', 'f', 'g' UNION ALL
SELECT 1, 'b', 'b', 'c', 'g', 'l' UNION ALL
SELECT 1, 'c', 'b', 'c', 'f', 'e' UNION ALL
SELECT 2, 'd', 'e', 'c', 'i', 'r' UNION ALL
SELECT 2, 'd', 'f', 'd', 'j', 'q' UNION ALL
SELECT 2, 'i', 'a', 'g', 'j', 'q' UNION ALL
SELECT 2, 'e', 'c', 'a', 'j', 'e' UNION ALL
SELECT 2, 'f', 'e', 'f', 'j', 'q' UNION ALL
SELECT 3, 'd', 'e', 'c', 'i', 'r' UNION ALL
SELECT 3, 'd', 'f', 'c', 'j', 'q' UNION ALL
SELECT 3, 'i', 'a', 'g', 'j', 'q' UNION ALL
SELECT 3, 'e', 'c', 'g', 'j', 'e' UNION ALL
SELECT 3, 'f', 'e', 'f', 'j', 'q'
;
--===== Solve the problem. This gets rid of 119 logical reads and 8 scans
-- which are caused by a "Triangular Join" in the other methods.
WITH
cteUnpivotAndCount AS
( --=== Unpivot the data and count the occurances at the same time.
SELECT unpvt.ID, unpvt.Field, unpvt.Value, COUNT(*) AS ValueCount
FROM (SELECT ID, Field1, Field2, Field3, Field4, Field5 FROM #TestTable) src
UNPIVOT (Value FOR Field IN (Field1, Field2, Field3, Field4, Field5))AS unpvt
GROUP BY unpvt.ID, unpvt.Field, unpvt.Value
)
,
cteTotalItems AS
( --=== Count the number of rows for each ID
SELECT ID, COUNT(*) AS TotalCount
FROM #TestTable
GROUP BY ID
)
,
cteRank AS
( --=== Create "opposing" rank columns
SELECT uac.ID, uac.Field, uac.Value, uac.ValueCount,
ROW_NUMBER() OVER (PARTITION BY uac.ID, uac.Field ORDER BY uac.ValueCount ASC) AS RevRank,
ROW_NUMBER() OVER (PARTITION BY uac.ID, uac.Field ORDER BY uac.ValueCount DESC) AS FwdRank,
ti.TotalCount
FROM cteUnpivotAndCount uac
INNER JOIN cteTotalItems ti
ON uac.ID = ti.ID
) --=== We have enough information for the criteria to work now. A Cross Tab like this is a little faster than PIVOT.
SELECT ID,
MAX(CASE WHEN Field = 'Field1' THEN Value ELSE '!' END) AS Field1,
MAX(CASE WHEN Field = 'Field2' THEN Value ELSE '!' END) AS Field2,
MAX(CASE WHEN Field = 'Field3' THEN Value ELSE '!' END) AS Field3,
MAX(CASE WHEN Field = 'Field4' THEN Value ELSE '!' END) AS Field4,
MAX(CASE WHEN Field = 'Field5' THEN Value ELSE '!' END) AS Field5
FROM cteRank
WHERE (FwdRank = 1 AND ValueCount + RevRank + FwdRank -2 >= TotalCount)
GROUP BY ID
--Jeff Moden
Change is inevitable... Change for the better is not.
May 21, 2010 at 9:36 am
And what I wanted was:
select ID
,dbo.MostCommonValue(Field1)
,dbo.MostCommonValue(Field2)
,dbo.MostCommonValue(Field3)
,dbo.MostCommonValue(Field4)
,dbo.MostCommonValue(Field5)
from #t_temp
group by ID
Easy hey?
Now, only what is left is just to write CLR.
OK, it may be not the best implementation, but...
Here we are:
using System;
using System.Collections.Generic;
using System.Collections;
using System.Text;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
using System.IO;
namespace StrAggr
{
[Serializable]
[SqlUserDefinedAggregate(
Format.UserDefined, //use custom serialization to serialize the intermediate result
IsInvariantToNulls = true, //optimizer property
IsInvariantToDuplicates = false, //optimizer property
IsInvariantToOrder = false, //optimizer property
MaxByteSize = 8000) //maximum size in bytes of persisted value
]
public class MostCommonValue : IBinarySerialize
{
private Dictionary<string, int> vals;
public void Init()
{
this.vals = new Dictionary<string, int>();
}
// If the next value is not null, and not in dictionary already - add it, if it's in dictionary - increase count
public void Accumulate(SqlString value)
{
if (value.IsNull)
{
return;
}
if (this.vals.ContainsKey(value.Value))
{
this.vals[value.Value]++;
}
else
{
this.vals.Add(value.Value, 1);
}
}
//Merges the partial aggregate with this aggregate
public void Merge(MostCommonValue part)
{
foreach (KeyValuePair<string, int> item in part.vals)
{
if (this.vals.ContainsKey(item.Key))
{
this.vals[item.Key] = this.vals[item.Key] + item.Value;
}
else
{
this.vals.Add(item.Key, item.Value);
}
}
}
//Returns the result of the aggregation when finished
public SqlString Terminate()
{
//there are other ways of finding maximum by using List and Sort,
//but SQL Server doesn't like using delegates...
int maxOcc = 0;
string res = "!";
if (this.vals != null)
{
foreach (KeyValuePair<string, int> item in this.vals)
{
if (maxOcc < item.Value)
{
res = item.Key;
maxOcc = item.Value;
}
else if (maxOcc == item.Value)
{
res = "!";
}
}
}
return new SqlString(res);
}
public void Write(BinaryWriter writer)
{
//would be really helpfull if Dictionary was serializable....
//we need to simulate serialization here.
//probably, better implemenation required (I hate to choose separators...)
StringBuilder sb = new StringBuilder();
foreach (KeyValuePair<string, int> item in this.vals)
{
sb.Append(item.Key);
sb.Append("{:}");
sb.Append(item.Value.ToString());
sb.Append("{-}");
}
writer.Write(sb.ToString());
}
public void Read(BinaryReader reader)
{
this.vals = new Dictionary<string, int>();
string sVals = reader.ReadString();
string [] valKP = sVals.Split(new string[]{"{-}"},StringSplitOptions.None);
foreach (string kp in valKP)
{
string[] valpair = kp.Split(new string[] { "{:}" }, StringSplitOptions.None);
if (valpair.Length==2) this.vals.Add(valpair[0], Int32.Parse(valpair[1]));
}
}
}
}
Compile it into library and registry assemply:
-- just in case if it's disabled
EXEC sp_configure 'clr enabled', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
--DROP AGGREGATE MostCommonValue
--DROP ASSEMBLY StrAggr
GO
CREATE ASSEMBLY StrAggr FROM 'C:\_temp\SQLDLL\straggr.dll'
GO
CREATE AGGREGATE MostCommonValue (@input nvarchar(4000)) RETURNS nvarchar(4000)
EXTERNAL NAME StrAggr.[StrAggr.MostCommonValue]
GO
Now you can try my first query...
Cheers,
Me
May 22, 2010 at 2:21 am
I have an idea for Read and Write method which will not use any separators to simulate serialization of Dictionary. I will do it on Monday.
Is anyone know the best way to serialize Dictionary object in C#? I don't need support fo generic Dictionary. It will be ok just for string & int KeyValuePairs.
May 25, 2010 at 9:28 am
I like your solution a lot Jeff, definitely avoids the necessity of adding a new CTE for every column you have. Only thing that needs to be added now is another set of columns in the CTE and in the SELECT query.
I've modified it slightly though. I noticed you had invested a decent amount of code into handling scenarios where there is not one distinct value which is the highest in count. This scenario is not actually important to me - I'm fine with choosing any one of them. So, here's my modification of your code:
--===== Conditionally drop the test table. It makes reruns easier
-- and leaves a persistent table for easier troubleshooting
-- than a table variable.
IF OBJECT_ID('TempDB..#TestTable','U') IS NOT NULL
DROP TABLE #TestTable
--===== Recreate the test table.
CREATE TABLE #TestTable
(
ID INT,
Field1 VARCHAR(MAX),
Field2 VARCHAR(MAX),
Field3 VARCHAR(MAX),
Field4 VARCHAR(MAX),
Field5 VARCHAR(MAX)
)
;
--===== Populate the test table... I added an extra scenario in ID 3
INSERT INTO #TestTable (ID, Field1, Field2, Field3, Field4, Field5)
SELECT 1, 'a', 'b', 'c', 'd', 'l' UNION ALL
SELECT 1, 'a', 'b', 'e', 'f', 'g' UNION ALL
SELECT 1, 'b', 'b', 'c', 'g', 'l' UNION ALL
SELECT 1, 'c', 'b', 'c', 'f', 'e' UNION ALL
SELECT 2, 'd', 'e', 'c', 'i', 'r' UNION ALL
SELECT 2, 'd', 'f', 'd', 'j', 'q' UNION ALL
SELECT 2, 'i', 'a', 'g', 'j', 'q' UNION ALL
SELECT 2, 'e', 'c', 'a', 'j', 'e' UNION ALL
SELECT 2, 'f', 'e', 'f', 'j', 'q' UNION ALL
SELECT 3, 'd', 'e', 'c', 'i', 'r' UNION ALL
SELECT 3, 'd', 'f', 'c', 'j', 'q' UNION ALL
SELECT 3, 'i', 'a', 'g', 'j', 'q' UNION ALL
SELECT 3, 'e', 'c', 'g', 'j', 'e' UNION ALL
SELECT 3, 'f', 'e', 'f', 'j', 'q'
;
--===== Solve the problem. This gets rid of 119 logical reads and 8 scans
-- which are caused by a "Triangular Join" in the other methods.
WITH cteUnpivotAndCount AS
( --=== Unpivot the data and count the occurances at the same time.
SELECT
unpvt.ID,
unpvt.Field,
unpvt.Value,
COUNT(*) AS ValueCount,
ROW_NUMBER() OVER (PARTITION BY unpvt.ID, unpvt.Field ORDER BY COUNT(*) DESC) AS FwdRank
FROM (SELECT ID, Field1, Field2, Field3, Field4, Field5 FROM #TestTable) src
UNPIVOT (Value FOR Field IN (Field1, Field2, Field3, Field4, Field5)) AS unpvt
GROUP BY unpvt.ID, unpvt.Field, unpvt.Value
)
SELECT
ID,
MAX(CASE WHEN Field = 'Field1' THEN Value END) AS Field1,
MAX(CASE WHEN Field = 'Field2' THEN Value END) AS Field2,
MAX(CASE WHEN Field = 'Field3' THEN Value END) AS Field3,
MAX(CASE WHEN Field = 'Field4' THEN Value END) AS Field4,
MAX(CASE WHEN Field = 'Field5' THEN Value END) AS Field5
FROM cteUnpivotAndCount
WHERE FwdRank = 1
GROUP BY ID
May 25, 2010 at 5:26 pm
kramaswamy (5/25/2010)
This scenario is not actually important to me - I'm fine with choosing any one of them.
Heh... now that would have been nice to know up front. 😛 Thanks for the feedback and the code, kramaswamy. I appreciate the time and I'm glad I could help.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply