Calculate number of groups and group size with multiple criteria

Question

Post reply

Calculate number of groups and group size with multiple criteria

helal.mobasher 13209

SSCommitted

Points: 1526
More actions
June 15, 2015 at 5:52 pm

#309109

My data looks like this:
I need to calculate the last two columns (noofgrp and grpsize) No of Groups (count of Clientid) and Group Size (number of clients in each group) according to begtim and endtime. So I tried the following in the first Temp table
GrpSize= count(clientid) over (partition by begtime,endtime) else 0 end
and in the second Temp Table, I have
select
,GrpSize=sum(grpsize)
,NoofGrp=count(distinct grpsize)
From Temp1
The issue is for the date of 5/26, the begtime and endtime are not consistent. in Grp1 (group 1) all clients starts the session at 1030 and ends at 1200 (90 minutes session) except one who starts at 11 and end at 1200 (row 8). For this client since his/her endtime is the same as others, I want that client to be in the first group(Grp1). Reverse is true for the second group (Grp2). All clients begtime is 12:30 and endtime is 1400 but clientid=2 (row 9) who begtime =1230 but endtime = 1300. However, since this client begtime is the same as the rest, I wan that client to be in the second group (grp2) My partition over creates 4 groups rather than two. Any suggestions?
Thank You,
Helal

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63519 More actions · Answer 1

Please post consumable data -- in the form of CREATE TABLE/INSERT statements or a SELECT statement with UNIONS...

http://www.sqlservercentral.com/articles/Best+Practices/61537/

Dohsan SSCrazy Points: 2972 More actions · Answer 2

pietlinden (6/15/2015)

Please post consumable data -- in the form of CREATE TABLE/INSERT statements or a SELECT statement with UNIONS...

http://www.sqlservercentral.com/articles/Best+Practices/61537/%5B/quote%5D

I've generated some test data based on your attachment for people to use:

USE tempdb;

DECLARE @GroupTestData TABLE

(

DT DATETIME NOT NULL,

ClientID INT NOT NULL,

BegTime TIME NOT NULL,

EndTime TIME NOT NULL,

Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

);

INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

VALUES('20150506',1,'10:30','12:00'),

('20150506',2,'10:30','12:00'),

('20150506',4,'10:30','12:00'),

('20150506',5,'10:30','12:00'),

('20150506',6,'10:30','12:00'),

('20150506',8,'10:30','12:00'),

('20150506',10,'11:00','12:00'),

('20150506',2,'12:30','13:00'),

('20150506',3,'12:30','14:00'),

('20150506',7,'12:30','14:00'),

('20150506',9,'12:30','14:00'),

('20150507',10,'12:30','14:00');

One question I have, you've included client 10 from 20150507 in "Group2" even though its on a different day. Is there a business reason for this or just a typo?

Dohsan SSCrazy Points: 2972 More actions · Answer 3

Sure it can be done more efficiently than this, but this appears to work:

WITH AddRank

AS

(

SELECTGT.DT,

GT.ClientID,

GT.BegTime,

GT.EndTime,

GT.Duration,

--Remove Partition by DT if you want times grouped regardless of day)

BegRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.BegTime),

EndRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.EndTime)

FROM@GroupTestData AS GT

),

AddCount

AS

(

SELECTAR.DT,

AR.ClientID,

AR.BegTime,

AR.EndTime,

AR.Duration,

AR.BegRank,

AR.EndRank,

BegCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.BegRank),

EndCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.EndRank)

FROMAddRank AS AR

),

DeriveGroupCount

AS

(

SELECTAT.DT,

AT.ClientID,

AT.BegTime,

AT.EndTime,

AT.Duration,

AT.BegRank,

AT.EndRank,

AT.BegCount,

AT.EndCount,

GroupCount = CASE

WHEN AT.BegCount < AT.EndCount THEN AT.EndCount

WHEN AT.BegCount > AT.EndCount THEN AT.BegCount

ELSE AT.BegCount

END,

GroupRank = CASE

WHEN AT.BegCount < AT.EndCount THEN AT.EndRank

WHEN AT.BegCount > AT.EndCount THEN AT.BegRank

ELSE AT.EndRank

END

FROMAddCount AS AT

)

SELECTDG.DT,

DG.ClientID,

DG.BegTime,

DG.EndTime,

DG.Duration,

GroupNum = 'Group' + CAST(DENSE_RANK() OVER (PARTITION BY DG.DT ORDER BY DG.GroupRank) AS VARCHAR),

DG.GroupCount

FROMDeriveGroupCount AS DG;

I've using the windows functions partitioning by the day, removing this gets the results you put as expected, but wasn't sure if that was a typo or not

helal.mobasher 13209 SSCommitted Points: 1526 More actions · Answer 4

Thank you for sharing the code. For most part, the code is working since I can get no of groups from your GroupNum column. However, GrpSize (GroupCount) should be 7 for the Group1 and 5 for Group2. Yours come up like:

DTClientIDBegTimeEndTimeDurationGroupNumGroupCount

2015-05-06 00:00:00.000110:30:00.000000012:00:00.000000090Group17

2015-05-06 00:00:00.000210:30:00.000000012:00:00.000000090Group17

2015-05-06 00:00:00.000410:30:00.000000012:00:00.000000090Group17

2015-05-06 00:00:00.000510:30:00.000000012:00:00.000000090Group17

2015-05-06 00:00:00.000610:30:00.000000012:00:00.000000090Group17

2015-05-06 00:00:00.000810:30:00.000000012:00:00.000000090Group17

2015-05-06 00:00:00.0001011:00:00.000000012:00:00.000000060Group17

2015-05-06 00:00:00.000212:30:00.000000013:00:00.000000030Group24

2015-05-06 00:00:00.000312:30:00.000000014:00:00.000000090Group24

2015-05-06 00:00:00.000712:30:00.000000014:00:00.000000090Group24

2015-05-06 00:00:00.000912:30:00.000000014:00:00.000000090Group24

2015-05-07 00:00:00.0001012:30:00.000000014:00:00.000000090Group11

Thank You,

Helal

Dohsan SSCrazy Points: 2972 More actions · Answer 5

helal.mobasher 13209 (6/16/2015)
Thank you for sharing the code. For most part, the code is working since I can get no of groups from your GroupNum column. However, GrpSize (GroupCount) should be 7 for the Group1 and 5 for Group2. Yours come up like:
DTClientIDBegTimeEndTimeDurationGroupNumGroupCount
2015-05-06 00:00:00.000110:30:00.000000012:00:00.000000090Group17
2015-05-06 00:00:00.000210:30:00.000000012:00:00.000000090Group17
2015-05-06 00:00:00.000410:30:00.000000012:00:00.000000090Group17
2015-05-06 00:00:00.000510:30:00.000000012:00:00.000000090Group17
2015-05-06 00:00:00.000610:30:00.000000012:00:00.000000090Group17
2015-05-06 00:00:00.000810:30:00.000000012:00:00.000000090Group17
2015-05-06 00:00:00.0001011:00:00.000000012:00:00.000000060Group17
2015-05-06 00:00:00.000212:30:00.000000013:00:00.000000030Group24
2015-05-06 00:00:00.000312:30:00.000000014:00:00.000000090Group24
2015-05-06 00:00:00.000712:30:00.000000014:00:00.000000090Group24
2015-05-06 00:00:00.000912:30:00.000000014:00:00.000000090Group24
2015-05-07 00:00:00.0001012:30:00.000000014:00:00.000000090Group11
Thank You,
Helal

Helal - if you would see my post above, I was questioning whether that particular row should be grouped with the rest, as it is on a different day. If day is irrelevant see the comments in the code, you can remove the PARTITION BY DT from the window functions, it should then provide the expected output

USE tempdb;

DECLARE @GroupTestData TABLE

(

DT DATETIME NOT NULL,

ClientID INT NOT NULL,

BegTime TIME NOT NULL,

EndTime TIME NOT NULL,

Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

);

INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

VALUES('20150506',1,'10:30','12:00'),

('20150506',2,'10:30','12:00'),

('20150506',4,'10:30','12:00'),

('20150506',5,'10:30','12:00'),

('20150506',6,'10:30','12:00'),

('20150506',8,'10:30','12:00'),

('20150506',10,'11:00','12:00'),

('20150506',2,'12:30','13:00'),

('20150506',3,'12:30','14:00'),

('20150506',7,'12:30','14:00'),

('20150506',9,'12:30','14:00'),

('20150507',10,'12:30','14:00');

WITH AddRank

AS

(

SELECTGT.DT,

GT.ClientID,

GT.BegTime,

GT.EndTime,

GT.Duration,

--Remove Partition by DT if you want times grouped regardless of day)

--BegRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.BegTime),

--EndRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.EndTime)

BegRank = DENSE_RANK() OVER (ORDER BY GT.BegTime),

EndRank = DENSE_RANK() OVER (ORDER BY GT.EndTime)

FROM@GroupTestData AS GT

),

AddCount

AS

(

SELECTAR.DT,

AR.ClientID,

AR.BegTime,

AR.EndTime,

AR.Duration,

AR.BegRank,

AR.EndRank,

--BegCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.BegRank),

--EndCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.EndRank)

BegCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.BegRank),

EndCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.EndRank)

FROMAddRank AS AR

),

DeriveGroupCount

AS

(

SELECTAT.DT,

AT.ClientID,

AT.BegTime,

AT.EndTime,

AT.Duration,

AT.BegRank,

AT.EndRank,

AT.BegCount,

AT.EndCount,

GroupCount = CASE

WHEN AT.BegCount < AT.EndCount THEN AT.EndCount

WHEN AT.BegCount > AT.EndCount THEN AT.BegCount

ELSE AT.BegCount

END,

GroupRank = CASE

WHEN AT.BegCount < AT.EndCount THEN AT.EndRank

WHEN AT.BegCount > AT.EndCount THEN AT.BegRank

ELSE AT.EndRank

END

FROMAddCount AS AT

)

SELECTDG.DT,

DG.ClientID,

DG.BegTime,

DG.EndTime,

DG.Duration,

--GroupNum = 'Group' + CAST(DENSE_RANK() OVER (PARTITION BY DG.DT ORDER BY DG.GroupRank) AS VARCHAR),

GroupNum = 'Group' + CAST(DENSE_RANK() OVER (ORDER BY DG.GroupRank) AS VARCHAR),

DG.GroupCount

FROMDeriveGroupCount AS DG;

helal.mobasher 13209 SSCommitted Points: 1526 More actions · Answer 6

Dohsan,

My bad of overlooking your question in the first post. You were right and that was a typo. Thus date matters and 5/7 should have been 5/6. Thank you for your help.

Helal

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 7

Here's another solution... I haven't tested it against Dohsan's solution, so I can't tell you which will perform better.

(This vast majority of the cost on the following is in sorts... The majority of which should be able to be addressed with the correct covering index)

-- Dosan's Test Data --

DECLARE @GroupTestData TABLE

(

DT DATETIME NOT NULL,

ClientID INT NOT NULL,

BegTime TIME NOT NULL,

EndTime TIME NOT NULL,

Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

);

INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

VALUES('20150506',1,'10:30','12:00'),

('20150506',2,'10:30','12:00'),

('20150506',4,'10:30','12:00'),

('20150506',5,'10:30','12:00'),

('20150506',6,'10:30','12:00'),

('20150506',8,'10:30','12:00'),

('20150506',10,'11:00','12:00'),

('20150506',2,'12:30','13:00'),

('20150506',3,'12:30','14:00'),

('20150506',7,'12:30','14:00'),

('20150506',9,'12:30','14:00'),

('20150507',10,'12:30','14:00');

-- The Actual Solution --

;WITH FindGroupStarts AS (

SELECT

gtd.DT,

gtd.ClientID,

gtd.BegTime,

gtd.EndTime,

ROW_NUMBER() OVER (PARTITION BY gtd.ClientID ORDER BY gtd.DT, gtd.BegTime, gtd.ClientID) rn

FROM

@GroupTestData gtd

), PutRowInGroups AS (

SELECT

fgs.DT,

fgs.ClientID,

fgs.BegTime,

fgs.EndTime,

MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID ROWS UNBOUNDED PRECEDING) GroupNum

FROM

FindGroupStarts fgs

), MaxClientOfGroup AS (

SELECT

prg.GroupNum,

MAX(prg.ClientID) AS MaxClientID

FROM

PutRowInGroups prg

GROUP BY

prg.GroupNum

)

SELECT

prg.DT,

prg.ClientID,

prg.BegTime,

prg.EndTime,

CASE

WHEN prg.ClientID = mcg.MaxClientID

THEN 'Grp' + CAST(prg.GroupNum AS VARCHAR(10)) + '=' + CAST(COUNT(*) OVER (PARTITION BY prg.GroupNum) AS VARCHAR(10))

ELSE ''

END AS GrpSize

FROM

PutRowInGroups prg

JOIN MaxClientOfGroup mcg

ON prg.GroupNum = mcg.GroupNum

helal.mobasher 13209 SSCommitted Points: 1526 More actions · Answer 8

Doohsan,

I needed to add couple more variables into the partition statement. Thus, I modified the data and codes accordingly (see below). First I get this error message:

The Parallel Data Warehouse (PDW) features are not enabled.

Second, am I doing this correctly since GroupCount seems to be ok but not GroupRank that is all 1!

Thank You Helal

DECLARE @GroupTestData TABLE

(

DT DATETIME NOT NULL,

ClientID INT NOT NULL,

ProvID INT NOT NULL,

Svctype char (5) NOT NULL,

BegTime TIME NOT NULL,

EndTime TIME NOT NULL,

Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

);

INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

VALUES

('20150506',1,10,'K7','10:30','12:00'),

('20150506',2,10,'K7','10:30','12:00'),

('20150506',4,10,'K7','10:30','12:00'),

('20150506',5,10,'K7','10:30','12:00'),

('20150506',6,10,'K7','10:30','12:00'),

('20150506',8,10,'K7','10:30','12:00'),

('20150506',10,10,'K7','11:00','12:00'),

('20150506',2,10,'K7','12:30','13:00'),

('20150506',3,10,'K7','12:30','14:00'),

('20150506',7,10,'K7','12:30','14:00'),

('20150506',9,10,'K7','12:30','14:00'),

('20150506',10,10,'K7','12:30','14:00');

; WITH AddRank

AS

(

SELECTGT.DT,

GT.ClientID,

GT.BegTime,

GT.EndTime,

GT.Duration,

--Remove Partition by DT if you want times grouped regardless of day)

BegRank = DENSE_RANK() OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY GT.BegTime),

EndRank = DENSE_RANK() OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY GT.EndTime)

FROM@GroupTestData AS GT

),

AddCount

AS

(

SELECTAR.DT,

AR.ClientID,

AR.BegTime,

AR.EndTime,

AR.Duration,

AR.BegRank,

AR.EndRank,

BegCount = COUNT(AR.ClientID) OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY AR.BegRank),

EndCount = COUNT(AR.ClientID) OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY AR.EndRank)

FROMAddRank AS AR

),

DeriveGroupCount

AS

(

SELECTAT.DT,

AT.ClientID,

AT.BegTime,

AT.EndTime,

AT.Duration,

AT.BegRank,

AT.EndRank,

AT.BegCount,

AT.EndCount,

GroupCount = CASE

WHEN AT.BegCount < AT.EndCount THEN AT.EndCount

WHEN AT.BegCount > AT.EndCount THEN AT.BegCount

ELSE AT.BegCount

END,

GroupRank = CASE

WHEN AT.BegCount < AT.EndCount THEN AT.EndRank

WHEN AT.BegCount > AT.EndCount THEN AT.BegRank

ELSE AT.EndRank

END

FROMAddCount AS AT

)

SELECTDG.DT,

DG.ClientID,

DG.BegTime,

DG.EndTime,

DG.Duration,

GroupNum = 'Group' + CAST(DENSE_RANK() OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY DG.GroupRank) AS VARCHAR),

DG.GroupCount

FROMDeriveGroupCount AS DG;

helal.mobasher 13209 SSCommitted Points: 1526 More actions · Answer 9

Thank You. Coupl eissues:

1. I think there was a missing close parentheses in this statement right before ROWS. I added that.

\MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum

Then, I get this error:

Msg 11305, Level 15, State 10, Line 43

The Parallel Data Warehouse (PDW) features are not enabled.

last, how do I add more variables such ProvID and SvcType in to the partition code?

Can I do this:

MAX(fgs.rn) OVER (ORDER BY fgs.provid, fgs.svctype,fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum

Given the new data:

Thank You

DECLARE @GroupTestData TABLE

(

DT DATETIME NOT NULL,

ClientID INT NOT NULL,

BegTime TIME NOT NULL,

EndTime TIME NOT NULL,

Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

);

INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

VALUES('20150506',1,'10:30','12:00'),

('20150506',2,'10:30','12:00'),

('20150506',4,'10:30','12:00'),

('20150506',5,'10:30','12:00'),

('20150506',6,'10:30','12:00'),

('20150506',8,'10:30','12:00'),

('20150506',10,'11:00','12:00'),

('20150506',2,'12:30','13:00'),

('20150506',3,'12:30','14:00'),

('20150506',7,'12:30','14:00'),

('20150506',9,'12:30','14:00'),

('20150507',10,'12:30','14:00');

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 10

helal.mobasher 13209 (6/18/2015)
Thank You. Coupl eissues:
1. I think there was a missing close parentheses in this statement right before ROWS. I added that.
\MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum
Then, I get this error:
Msg 11305, Level 15, State 10, Line 43
The Parallel Data Warehouse (PDW) features are not enabled.
last, how do I add more variables such ProvID and SvcType in to the partition code?
Can I do this:
MAX(fgs.rn) OVER (ORDER BY fgs.provid, fgs.svctype,fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum
Given the new data:
Thank You
DECLARE @GroupTestData TABLE
(
DT DATETIME NOT NULL,
ClientID INT NOT NULL,
BegTime TIME NOT NULL,
EndTime TIME NOT NULL,
Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)
);
INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)
VALUES('20150506',1,'10:30','12:00'),
('20150506',2,'10:30','12:00'),
('20150506',4,'10:30','12:00'),
('20150506',5,'10:30','12:00'),
('20150506',6,'10:30','12:00'),
('20150506',8,'10:30','12:00'),
('20150506',10,'11:00','12:00'),
('20150506',2,'12:30','13:00'),
('20150506',3,'12:30','14:00'),
('20150506',7,'12:30','14:00'),
('20150506',9,'12:30','14:00'),
('20150507',10,'12:30','14:00');

The code executes fine exactly as I posted it. No missing parentheses... (tested on SQL Server Dev Edition)

The following is the correct syntax...

MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID ROWS UNBOUNDED PRECEDING)

I've never had the opportunity to get my mitts on a PDW so I can't say if there are any SQL syntax differences or if any additional functionality needs to be enabled or installed...

HTH,

Jason