Need help with query to generate a Matrix

  • I have a requirement for a query that, based on a set of data, it should return a matrix. The premise is this: There are 5 groups of people. Each person may belong to a single group at a time, but can switch groups from one month to the next. My matrix needs to have the count of people that changed groups duing the month. The matrix needs to look like the following:

    G1 G2 G3 G4 G5

    G1 0 1 0 4 2

    G2 1 0 1 1 1

    G3 3 0 0 2 7

    ...

    The intersection of a group with itself will always be 0 since one cannot change from group G1 to group G1 for example. The matrix shows that in the last month, from group G1, 1 person switched over to G2, 0 switched to G3, 4 switched to G4, and 2 switched to G5. From group G2, 1 person switched to G1, nobody switched to G2 or G3, and 1 person each switched to G3, G4 and G5; from G3, 3 people switched from G3 to G1, 2 to G4, and 7 to G5.

    Please let me know if this makes sense or not. Any help in terms of what the table structure and query should look like will be greatly appreciated.

    Michael

  • At the beginning of your post you said "based on a set of data"; but at the end "Any help in terms of what the table structure ... should look like...".

    That's rather confusing. If you could provide your current "set of data", we might be able to help you writing the query. You might also need to describe the business rules for people joining a group "from outside" or leaving the groups in question.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thank you for the follow up. The "set of data" I refer to come from several sources which I need to consolidate into the structure that I will query to build the matrix. I can pretty much transform it into anything I need (the resulting table structure.) Ideally, I would like it to be a singe table since it will contain a lot of data and will be used for reporting (i.e. generating the matrix.)

    The tricky part, I think, is querying the resulting data structure (whatever that ends up being) to produce the matrix as I outlined previously. The "rules", if you will, are:

    1. A person belonging to a group can move to another group once per month, or stay in the current group

    2. Any number of persons can join the various groups even if they don't currently belong to one

    If we use a more life-like example, think of this.

    1, A person belonging to a bank can switch to another bank once per month. I need to keep track of how many people switch banks, and to which banks they switch to/from. This is why the column headers and the row names are the same.

    2. A person without an account, opens an account at a bank. This would add 1 to the count of people doing business with that bank, and would show in the matrix as well.

    3. A person cannot switch to the same bank they currently do business with (i.e. The intersection of the row and column with the same name will always be 0.)

    You can take any type of data and model this; I'm looking for an efficient and maintainable implementation.

    Hope this helps!

    Michael

  • 2. Any number of persons can join the various groups even if they don't currently belong to one

    How would your expected output look like if all the changes for one month would be:

    One person moved from "the outside" to G1,

    One person moved from G2 to G3 and

    One person left G4 (to "the outside")?

    Are you sure the business model will never change to allow a moving to another group more than once per month? I'd rather be prepared for it... So you'd need to define how to deal with people moving G1 to G2 and back to G1 or moving from G1 to G2 to G3.

    Would you be able to provide a table structure like this:

    Month (datetime) from_Grp to_Grp no_of_people

    2011-01-01 3 5 6

    2011-01-01 2 1 1

    2011-02-01 3 0 1

    2011-02-01 2 1 0

    Another question would be, if the number of groups will be static or dynamic.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • LutzM (1/13/2011)


    2. Any number of persons can join the various groups even if they don't currently belong to one

    How would your expected output look like if all the changes for one month would be:

    One person moved from "the outside" to G1,

    One person moved from G2 to G3 and

    One person left G4 (to "the outside")?

    G1 = G1 + 1

    G2 = G2 -1 and G3 = G3 + 1

    G4 = G4 - 1

    Are you sure the business model will never change to allow a moving to another group more than once per month? I'd rather be prepared for it... So you'd need to define how to deal with people moving G1 to G2 and back to G1 or moving from G1 to G2 to G3.

    Yes, I'm sure. The data provider only updates once per month. It's too much data to do it otherwise. If a person goes from G1 to G2 and back to G1, we would not see that in the delta anyway.

    Another question would be, if the number of groups will be static or dynamic.

    Static...sort of. A change may come every so often (haven't had one in more than a year). I'll need to weigh the cost and complexity of making the solution dynamic vs. making it static and modifying it if/when the time comes in order to make the decision.

  • Almost understood.

    In your initial post you stated

    My matrix needs to have the count of people that changed groups duing the month.

    Referencing my previous post, how do you get G1 = G1 + 1 instead of G1 = 1?

    Also, please modify the values in the following matrix based on the scenario I described (being the only changes that occured in a given month) to reflect your expected output.

    Changes:

    One person moved from "the outside" to G1,

    One person moved from G2 to G3 and

    One person left G4 (to "the outside").

    Expected output:

    __G1 G2 G3 G4 G5

    G1 0 0 0 0 0

    G2 0 0 0 0 0

    G3 0 0 0 0 0

    G4 0 0 0 0 0

    G5 0 0 0 0 0



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • LutzM (1/13/2011)


    Almost understood.

    In your initial post you stated

    My matrix needs to have the count of people that changed groups duing the month.

    Referencing my previous post, how do you get G1 = G1 + 1 instead of G1 = 1?

    I apologize. Thinking about it, all I need is the count of people moving between groups so I don't care about those joining a group for the first time, those leaving a group (and not joining another one), or the total number of members in the group. I'll take the liberty to modify your example a bit.

    Also, please modify the values in the following matrix based on the scenario I described (being the only changes that occured in a given month) to reflect your expected output.

    Changes:

    One person moved from "the outside" to G1,

    One three persons moved from G2 to G3 and

    One person left G4 (to "the outside").

    Expected output based on your scenario:

    __G1 G2 G3 G4 G5

    G1 0 0 0 0 0

    G2 0 0 3 0 0

    G3 0 0 0 0 0

    G4 0 0 0 0 0

    G5 0 0 0 0 0

  • CELKO (1/13/2011)


    SQL uses tables and not matrice. You also seem to want to format data for display in the dataabse, which violated the whoel idea of a tiered archieture.

    My thought is a table of evetns:

    CREATE TABLE GroupEvents

    (member_id CHAR(10) NOT NULL,

    group_id CHAR(2) NOT NULL

    CHECK (group_id IN ('G1', 'G2', 'G3', 'G4', 'G5')),

    membership_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL,

    resignation_date DATE, --- null means currrent member

    CHECK (membership_date <= resignation_date),

    PRIMARY KEY (emp_id, membership_date)

    );

    Here is a quick way to find guys who stayed all month

    SELECT G1.member_id, group_id, 'No change'

    FROM GroupEvents AS G1

    WHERE G1.membership_date <= '2011-01-01'

    AND resignation_date IS NULL

    GROUP BY G1.member_id, group_id;

    We can modify this for changes, but we don't have specs that handle someone who changed groups more than twice in on month. Can I assume this is disallowed?

    Great post.

    Except it neither does answer the original question (Why on earth would I need "a quick way to find guys who stayed all month"?), nor does your query return correct results for December 2010 if the person moved in January 2011 nor will your query return the same results for any given month under any given DATEFORMAT or LANGUAGE setting (Strong hint: use SET DATEFORMAT to YMD and YDM and query each version against '2011-10-01').



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hey Joe,

    Thanks for your reply. There are about 100 groups or so that I need to track. But I only need to track the number of people moving from one group to another. Since my data provider only gives us updates once per month, I wouldn't know if a person moved back and forth between groups during the month. I would only know the last group he/she moved to, so you can say it is not allowed.

    I also don't care about those that leave all groups, or join a group for the first time since those are not moving between groups.

    Thanks!

  • From my point of view we have a few options for the base table design:

    a) using Joe's table design

    In this case you could easily end up with inconsistent data like below:

    member_id, group_id, membership_date, resignation_date

    ABCD,G1,20110101,20110218

    ABCD,G2,20110102,NULL

    Those two rows would be allowed to be inserted into the table he recommended. I have no idea how he'd be able to query this table...

    b) using a table design where you insert a new row as soon as a person joins a new group. However, you'd be able to insert a new person "from nowhere", but you would need a group_id to identify a person "leaving the orbit", e.g. group_id='G0'

    c) depending on how you want to use the data (e.g. access frequency and the like) and how much you would gain out of it you might need to preaggregate the data. in that case I'd use a design like

    CREATE TABLE AggregatedGroupEvents

    (YearMonth DATE NOT NULL,

    from_group_id CHAR(2) NOT NULL

    CHECK (group_id IN ('G1', 'G2', 'G3', 'G4', 'G5')),

    to_group_id CHAR(2) NOT NULL

    CHECK (group_id IN ('G1', 'G2', 'G3', 'G4', 'G5')),

    CHECK (from_group_id < > to_group_id),

    people_moved INT DEFAULT 0 NOT NULL,

    PRIMARY KEY (YearMonth, from_group_id,to_group_id)

    );

    And populate it on a scheduled base with the aggregated data of the previous month.

    This approach does not follow any kind of normalization. But sometimes preaggregation will help improve performance.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Thank you Lutz and Joe for your contributions. It is good to know that what I had been thinking about doing is pretty much the same as your suggestions. I also posed the same question to a couple of other people and they both came up with similar answers. To get the resultset I need, I added a PIVOT statement to create the matrix. So, here's a sample of what the end result will look like:

    DECLARE @temp TABLE (FromGroup nvarchar(50), ToGroup nvarchar(50), ChurnCount int, ChurnMonth tinyint, ChurnYear smallint)

    INSERT INTO @temp

    ( FromGroup ,

    ToGroup ,

    ChurnCount,

    ChurnMonth,

    ChurnYear

    )

    SELECT 'G1', 'G1', 0, 1, 2011

    UNION

    SELECT 'G1', 'G2', 2, 1, 2011

    UNION

    SELECT 'G2', 'G2', 0, 1, 2011

    UNION

    SELECT 'G2', 'G1', 1, 12, 2010

    UNION

    SELECT 'G3', 'G3', 0, 1, 2011

    UNION

    SELECT 'G3', 'G4', 1, 12, 2010

    UNION

    SELECT 'G4', 'G4', 0, 1, 2011

    UNION

    SELECT 'G4', 'G2', 2, 1, 2011

    UNION

    SELECT 'G1', 'G2', 50, 12, 2010

    SELECT FromGroup as GroupName, COALESCE([G1], 0) AS G1, COALESCE([G2], 0) as G2, COALESCE([G3], 0) AS G3, COALESCE([G4], 0) as G4

    FROM @Temp

    PIVOT

    (

    SUM(ChurnCount)

    FOR [ToGroup] IN ([G1], [G2], [G3], [G4])

    )

    AS p

    WHERE ChurnMonth = 12

    AND ChurnYear = 2010

    --WHERE ChurnMonth = 1

    --AND ChurnYear = 2011

  • Joe,

    why do you insist in using the resignation_date column?

    As mentioned in my previous post, there is a high chance of data inconsistency.

    Example (repeated from before):

    member_id, group_id, membership_date, resignation_date

    ABCD,G1,20110101,20110218

    ABCD,G2,20110102,NULL

    This is a valid entry as per your table def. But what does it mean?

    Another issue is your CHECK (membership_date < resignation_date) constraint. How will you insert a row representing the current member of a group? :ermm:

    Even your primary key won't help:

    member_id, group_id, membership_date, resignation_date

    ABCD,G1,20110101,20110218

    ABCD,G2,20110102,20110218

    ABCD,G3,20110103,NULL

    This {Edit: The [resignation_date]} column also hold duplicate (logical) information, since current resignation_date = next membership_date. And that's the reason I strongly vote against using this column. There are numerous other methods to link to the previous row that would provide consistent data. Unlike your approach.

    Please help me understand the benefit of the resignation_date column. TIA.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply