Calculate number of groups and group size with multiple criteria

  • My data looks like this:

    I need to calculate the last two columns (noofgrp and grpsize) No of Groups (count of Clientid) and Group Size (number of clients in each group) according to begtim and endtime. So I tried the following in the first Temp table

    GrpSize= count(clientid) over (partition by begtime,endtime) else 0 end

    and in the second Temp Table, I have

    select

    ,GrpSize=sum(grpsize)

    ,NoofGrp=count(distinct grpsize)

    From Temp1

    The issue is for the date of 5/26, the begtime and endtime are not consistent. in Grp1 (group 1) all clients starts the session at 1030 and ends at 1200 (90 minutes session) except one who starts at 11 and end at 1200 (row 8). For this client since his/her endtime is the same as others, I want that client to be in the first group(Grp1). Reverse is true for the second group (Grp2). All clients begtime is 12:30 and endtime is 1400 but clientid=2 (row 9) who begtime =1230 but endtime = 1300. However, since this client begtime is the same as the rest, I wan that client to be in the second group (grp2) My partition over creates 4 groups rather than two. Any suggestions?

    Thank You,

    Helal

  • Please post consumable data -- in the form of CREATE TABLE/INSERT statements or a SELECT statement with UNIONS...

    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • pietlinden (6/15/2015)


    Please post consumable data -- in the form of CREATE TABLE/INSERT statements or a SELECT statement with UNIONS...

    http://www.sqlservercentral.com/articles/Best+Practices/61537/%5B/quote%5D

    I've generated some test data based on your attachment for people to use:

    USE tempdb;

    DECLARE @GroupTestData TABLE

    (

    DT DATETIME NOT NULL,

    ClientID INT NOT NULL,

    BegTime TIME NOT NULL,

    EndTime TIME NOT NULL,

    Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

    );

    INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

    VALUES('20150506',1,'10:30','12:00'),

    ('20150506',2,'10:30','12:00'),

    ('20150506',4,'10:30','12:00'),

    ('20150506',5,'10:30','12:00'),

    ('20150506',6,'10:30','12:00'),

    ('20150506',8,'10:30','12:00'),

    ('20150506',10,'11:00','12:00'),

    ('20150506',2,'12:30','13:00'),

    ('20150506',3,'12:30','14:00'),

    ('20150506',7,'12:30','14:00'),

    ('20150506',9,'12:30','14:00'),

    ('20150507',10,'12:30','14:00');

    One question I have, you've included client 10 from 20150507 in "Group2" even though its on a different day. Is there a business reason for this or just a typo?

  • Sure it can be done more efficiently than this, but this appears to work:

    WITH AddRank

    AS

    (

    SELECTGT.DT,

    GT.ClientID,

    GT.BegTime,

    GT.EndTime,

    GT.Duration,

    --Remove Partition by DT if you want times grouped regardless of day)

    BegRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.BegTime),

    EndRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.EndTime)

    FROM@GroupTestData AS GT

    ),

    AddCount

    AS

    (

    SELECTAR.DT,

    AR.ClientID,

    AR.BegTime,

    AR.EndTime,

    AR.Duration,

    AR.BegRank,

    AR.EndRank,

    BegCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.BegRank),

    EndCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.EndRank)

    FROMAddRank AS AR

    ),

    DeriveGroupCount

    AS

    (

    SELECTAT.DT,

    AT.ClientID,

    AT.BegTime,

    AT.EndTime,

    AT.Duration,

    AT.BegRank,

    AT.EndRank,

    AT.BegCount,

    AT.EndCount,

    GroupCount = CASE

    WHEN AT.BegCount < AT.EndCount THEN AT.EndCount

    WHEN AT.BegCount > AT.EndCount THEN AT.BegCount

    ELSE AT.BegCount

    END,

    GroupRank = CASE

    WHEN AT.BegCount < AT.EndCount THEN AT.EndRank

    WHEN AT.BegCount > AT.EndCount THEN AT.BegRank

    ELSE AT.EndRank

    END

    FROMAddCount AS AT

    )

    SELECTDG.DT,

    DG.ClientID,

    DG.BegTime,

    DG.EndTime,

    DG.Duration,

    GroupNum = 'Group' + CAST(DENSE_RANK() OVER (PARTITION BY DG.DT ORDER BY DG.GroupRank) AS VARCHAR),

    DG.GroupCount

    FROMDeriveGroupCount AS DG;

    I've using the windows functions partitioning by the day, removing this gets the results you put as expected, but wasn't sure if that was a typo or not

  • Thank you for sharing the code. For most part, the code is working since I can get no of groups from your GroupNum column. However, GrpSize (GroupCount) should be 7 for the Group1 and 5 for Group2. Yours come up like:

    DTClientIDBegTimeEndTimeDurationGroupNumGroupCount

    2015-05-06 00:00:00.000110:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000210:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000410:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000510:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000610:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000810:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.0001011:00:00.000000012:00:00.000000060Group17

    2015-05-06 00:00:00.000212:30:00.000000013:00:00.000000030Group24

    2015-05-06 00:00:00.000312:30:00.000000014:00:00.000000090Group24

    2015-05-06 00:00:00.000712:30:00.000000014:00:00.000000090Group24

    2015-05-06 00:00:00.000912:30:00.000000014:00:00.000000090Group24

    2015-05-07 00:00:00.0001012:30:00.000000014:00:00.000000090Group11

    Thank You,

    Helal

  • helal.mobasher 13209 (6/16/2015)


    Thank you for sharing the code. For most part, the code is working since I can get no of groups from your GroupNum column. However, GrpSize (GroupCount) should be 7 for the Group1 and 5 for Group2. Yours come up like:

    DTClientIDBegTimeEndTimeDurationGroupNumGroupCount

    2015-05-06 00:00:00.000110:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000210:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000410:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000510:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000610:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.000810:30:00.000000012:00:00.000000090Group17

    2015-05-06 00:00:00.0001011:00:00.000000012:00:00.000000060Group17

    2015-05-06 00:00:00.000212:30:00.000000013:00:00.000000030Group24

    2015-05-06 00:00:00.000312:30:00.000000014:00:00.000000090Group24

    2015-05-06 00:00:00.000712:30:00.000000014:00:00.000000090Group24

    2015-05-06 00:00:00.000912:30:00.000000014:00:00.000000090Group24

    2015-05-07 00:00:00.0001012:30:00.000000014:00:00.000000090Group11

    Thank You,

    Helal

    Helal - if you would see my post above, I was questioning whether that particular row should be grouped with the rest, as it is on a different day. If day is irrelevant see the comments in the code, you can remove the PARTITION BY DT from the window functions, it should then provide the expected output

    USE tempdb;

    DECLARE @GroupTestData TABLE

    (

    DT DATETIME NOT NULL,

    ClientID INT NOT NULL,

    BegTime TIME NOT NULL,

    EndTime TIME NOT NULL,

    Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

    );

    INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

    VALUES('20150506',1,'10:30','12:00'),

    ('20150506',2,'10:30','12:00'),

    ('20150506',4,'10:30','12:00'),

    ('20150506',5,'10:30','12:00'),

    ('20150506',6,'10:30','12:00'),

    ('20150506',8,'10:30','12:00'),

    ('20150506',10,'11:00','12:00'),

    ('20150506',2,'12:30','13:00'),

    ('20150506',3,'12:30','14:00'),

    ('20150506',7,'12:30','14:00'),

    ('20150506',9,'12:30','14:00'),

    ('20150507',10,'12:30','14:00');

    WITH AddRank

    AS

    (

    SELECTGT.DT,

    GT.ClientID,

    GT.BegTime,

    GT.EndTime,

    GT.Duration,

    --Remove Partition by DT if you want times grouped regardless of day)

    --BegRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.BegTime),

    --EndRank = DENSE_RANK() OVER (PARTITION BY GT.DT ORDER BY GT.EndTime)

    BegRank = DENSE_RANK() OVER (ORDER BY GT.BegTime),

    EndRank = DENSE_RANK() OVER (ORDER BY GT.EndTime)

    FROM@GroupTestData AS GT

    ),

    AddCount

    AS

    (

    SELECTAR.DT,

    AR.ClientID,

    AR.BegTime,

    AR.EndTime,

    AR.Duration,

    AR.BegRank,

    AR.EndRank,

    --BegCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.BegRank),

    --EndCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.DT, AR.EndRank)

    BegCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.BegRank),

    EndCount = COUNT(AR.ClientID) OVER (PARTITION BY AR.EndRank)

    FROMAddRank AS AR

    ),

    DeriveGroupCount

    AS

    (

    SELECTAT.DT,

    AT.ClientID,

    AT.BegTime,

    AT.EndTime,

    AT.Duration,

    AT.BegRank,

    AT.EndRank,

    AT.BegCount,

    AT.EndCount,

    GroupCount = CASE

    WHEN AT.BegCount < AT.EndCount THEN AT.EndCount

    WHEN AT.BegCount > AT.EndCount THEN AT.BegCount

    ELSE AT.BegCount

    END,

    GroupRank = CASE

    WHEN AT.BegCount < AT.EndCount THEN AT.EndRank

    WHEN AT.BegCount > AT.EndCount THEN AT.BegRank

    ELSE AT.EndRank

    END

    FROMAddCount AS AT

    )

    SELECTDG.DT,

    DG.ClientID,

    DG.BegTime,

    DG.EndTime,

    DG.Duration,

    --GroupNum = 'Group' + CAST(DENSE_RANK() OVER (PARTITION BY DG.DT ORDER BY DG.GroupRank) AS VARCHAR),

    GroupNum = 'Group' + CAST(DENSE_RANK() OVER (ORDER BY DG.GroupRank) AS VARCHAR),

    DG.GroupCount

    FROMDeriveGroupCount AS DG;

  • Dohsan,

    My bad of overlooking your question in the first post. You were right and that was a typo. Thus date matters and 5/7 should have been 5/6. Thank you for your help.

    Helal

  • Here's another solution... I haven't tested it against Dohsan's solution, so I can't tell you which will perform better.

    (This vast majority of the cost on the following is in sorts... The majority of which should be able to be addressed with the correct covering index)

    -- Dosan's Test Data --

    DECLARE @GroupTestData TABLE

    (

    DT DATETIME NOT NULL,

    ClientID INT NOT NULL,

    BegTime TIME NOT NULL,

    EndTime TIME NOT NULL,

    Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

    );

    INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

    VALUES('20150506',1,'10:30','12:00'),

    ('20150506',2,'10:30','12:00'),

    ('20150506',4,'10:30','12:00'),

    ('20150506',5,'10:30','12:00'),

    ('20150506',6,'10:30','12:00'),

    ('20150506',8,'10:30','12:00'),

    ('20150506',10,'11:00','12:00'),

    ('20150506',2,'12:30','13:00'),

    ('20150506',3,'12:30','14:00'),

    ('20150506',7,'12:30','14:00'),

    ('20150506',9,'12:30','14:00'),

    ('20150507',10,'12:30','14:00');

    -- The Actual Solution --

    ;WITH FindGroupStarts AS (

    SELECT

    gtd.DT,

    gtd.ClientID,

    gtd.BegTime,

    gtd.EndTime,

    ROW_NUMBER() OVER (PARTITION BY gtd.ClientID ORDER BY gtd.DT, gtd.BegTime, gtd.ClientID) rn

    FROM

    @GroupTestData gtd

    ), PutRowInGroups AS (

    SELECT

    fgs.DT,

    fgs.ClientID,

    fgs.BegTime,

    fgs.EndTime,

    MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID ROWS UNBOUNDED PRECEDING) GroupNum

    FROM

    FindGroupStarts fgs

    ), MaxClientOfGroup AS (

    SELECT

    prg.GroupNum,

    MAX(prg.ClientID) AS MaxClientID

    FROM

    PutRowInGroups prg

    GROUP BY

    prg.GroupNum

    )

    SELECT

    prg.DT,

    prg.ClientID,

    prg.BegTime,

    prg.EndTime,

    CASE

    WHEN prg.ClientID = mcg.MaxClientID

    THEN 'Grp' + CAST(prg.GroupNum AS VARCHAR(10)) + '=' + CAST(COUNT(*) OVER (PARTITION BY prg.GroupNum) AS VARCHAR(10))

    ELSE ''

    END AS GrpSize

    FROM

    PutRowInGroups prg

    JOIN MaxClientOfGroup mcg

    ON prg.GroupNum = mcg.GroupNum

  • Doohsan,

    I needed to add couple more variables into the partition statement. Thus, I modified the data and codes accordingly (see below). First I get this error message:

    The Parallel Data Warehouse (PDW) features are not enabled.

    Second, am I doing this correctly since GroupCount seems to be ok but not GroupRank that is all 1!

    Thank You Helal

    DECLARE @GroupTestData TABLE

    (

    DT DATETIME NOT NULL,

    ClientID INT NOT NULL,

    ProvID INT NOT NULL,

    Svctype char (5) NOT NULL,

    BegTime TIME NOT NULL,

    EndTime TIME NOT NULL,

    Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

    );

    INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

    VALUES

    ('20150506',1,10,'K7','10:30','12:00'),

    ('20150506',2,10,'K7','10:30','12:00'),

    ('20150506',4,10,'K7','10:30','12:00'),

    ('20150506',5,10,'K7','10:30','12:00'),

    ('20150506',6,10,'K7','10:30','12:00'),

    ('20150506',8,10,'K7','10:30','12:00'),

    ('20150506',10,10,'K7','11:00','12:00'),

    ('20150506',2,10,'K7','12:30','13:00'),

    ('20150506',3,10,'K7','12:30','14:00'),

    ('20150506',7,10,'K7','12:30','14:00'),

    ('20150506',9,10,'K7','12:30','14:00'),

    ('20150506',10,10,'K7','12:30','14:00');

    ; WITH AddRank

    AS

    (

    SELECTGT.DT,

    GT.ClientID,

    GT.BegTime,

    GT.EndTime,

    GT.Duration,

    --Remove Partition by DT if you want times grouped regardless of day)

    BegRank = DENSE_RANK() OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY GT.BegTime),

    EndRank = DENSE_RANK() OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY GT.EndTime)

    FROM@GroupTestData AS GT

    ),

    AddCount

    AS

    (

    SELECTAR.DT,

    AR.ClientID,

    AR.BegTime,

    AR.EndTime,

    AR.Duration,

    AR.BegRank,

    AR.EndRank,

    BegCount = COUNT(AR.ClientID) OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY AR.BegRank),

    EndCount = COUNT(AR.ClientID) OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY AR.EndRank)

    FROMAddRank AS AR

    ),

    DeriveGroupCount

    AS

    (

    SELECTAT.DT,

    AT.ClientID,

    AT.BegTime,

    AT.EndTime,

    AT.Duration,

    AT.BegRank,

    AT.EndRank,

    AT.BegCount,

    AT.EndCount,

    GroupCount = CASE

    WHEN AT.BegCount < AT.EndCount THEN AT.EndCount

    WHEN AT.BegCount > AT.EndCount THEN AT.BegCount

    ELSE AT.BegCount

    END,

    GroupRank = CASE

    WHEN AT.BegCount < AT.EndCount THEN AT.EndRank

    WHEN AT.BegCount > AT.EndCount THEN AT.BegRank

    ELSE AT.EndRank

    END

    FROMAddCount AS AT

    )

    SELECTDG.DT,

    DG.ClientID,

    DG.BegTime,

    DG.EndTime,

    DG.Duration,

    GroupNum = 'Group' + CAST(DENSE_RANK() OVER (PARTITION BY GT.PROVID, GT.SVCTYPE, GT.DT ORDER BY DG.GroupRank) AS VARCHAR),

    DG.GroupCount

    FROMDeriveGroupCount AS DG;

  • Thank You. Coupl eissues:

    1. I think there was a missing close parentheses in this statement right before ROWS. I added that.

    \MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum

    Then, I get this error:

    Msg 11305, Level 15, State 10, Line 43

    The Parallel Data Warehouse (PDW) features are not enabled.

    last, how do I add more variables such ProvID and SvcType in to the partition code?

    Can I do this:

    MAX(fgs.rn) OVER (ORDER BY fgs.provid, fgs.svctype,fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum

    Given the new data:

    Thank You

    DECLARE @GroupTestData TABLE

    (

    DT DATETIME NOT NULL,

    ClientID INT NOT NULL,

    BegTime TIME NOT NULL,

    EndTime TIME NOT NULL,

    Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

    );

    INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

    VALUES('20150506',1,'10:30','12:00'),

    ('20150506',2,'10:30','12:00'),

    ('20150506',4,'10:30','12:00'),

    ('20150506',5,'10:30','12:00'),

    ('20150506',6,'10:30','12:00'),

    ('20150506',8,'10:30','12:00'),

    ('20150506',10,'11:00','12:00'),

    ('20150506',2,'12:30','13:00'),

    ('20150506',3,'12:30','14:00'),

    ('20150506',7,'12:30','14:00'),

    ('20150506',9,'12:30','14:00'),

    ('20150507',10,'12:30','14:00');

  • helal.mobasher 13209 (6/18/2015)


    Thank You. Coupl eissues:

    1. I think there was a missing close parentheses in this statement right before ROWS. I added that.

    \MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum

    Then, I get this error:

    Msg 11305, Level 15, State 10, Line 43

    The Parallel Data Warehouse (PDW) features are not enabled.

    last, how do I add more variables such ProvID and SvcType in to the partition code?

    Can I do this:

    MAX(fgs.rn) OVER (ORDER BY fgs.provid, fgs.svctype,fgs.DT, fgs.BegTime, fgs.ClientID) ROWS UNBOUNDED PRECEDING) GroupNum

    Given the new data:

    Thank You

    DECLARE @GroupTestData TABLE

    (

    DT DATETIME NOT NULL,

    ClientID INT NOT NULL,

    BegTime TIME NOT NULL,

    EndTime TIME NOT NULL,

    Duration AS CAST(DATEDIFF(MI,BegTime,EndTime) AS INT)

    );

    INSERT INTO @GroupTestData (DT,ClientID,BegTime,EndTime)

    VALUES('20150506',1,'10:30','12:00'),

    ('20150506',2,'10:30','12:00'),

    ('20150506',4,'10:30','12:00'),

    ('20150506',5,'10:30','12:00'),

    ('20150506',6,'10:30','12:00'),

    ('20150506',8,'10:30','12:00'),

    ('20150506',10,'11:00','12:00'),

    ('20150506',2,'12:30','13:00'),

    ('20150506',3,'12:30','14:00'),

    ('20150506',7,'12:30','14:00'),

    ('20150506',9,'12:30','14:00'),

    ('20150507',10,'12:30','14:00');

    The code executes fine exactly as I posted it. No missing parentheses... (tested on SQL Server Dev Edition)

    The following is the correct syntax...

    MAX(fgs.rn) OVER (ORDER BY fgs.DT, fgs.BegTime, fgs.ClientID ROWS UNBOUNDED PRECEDING)

    I've never had the opportunity to get my mitts on a PDW so I can't say if there are any SQL syntax differences or if any additional functionality needs to be enabled or installed...

    HTH,

    Jason

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply