Normalization Question

  • GSquared (2/28/2011)


    From what I can see of your table, you have essentially a full cartesian join of three tables: Program, Year, Major. That means you have a number of rows equal to the number of rows in each table multiplied together. If you have 3 programs, 5 years, 2 majors, you have 30 rows (3 * 5 * 2). That's what it looks like to me.

    Per what I'm reading in your statement, you'd have 5 rows (the number of rows in the largest table).

    I'm obviously missing something, but I don't know what.

    In order to help, I need a more clear definition of the tables involved and the data in them. Right now, it looks like the usual denormalized data I've seen in dozens of data warehousing solutions. Per your statements, it's not like that at all. I need to see more in order to understand how/why.

    If in your example you mean that each program has 5 years and 2 majors to it, then my table would look something like this:

    Program,Year,Major

    Education,2006,Math

    Education,2007,Biology

    Education,2008,NULL

    Education,2009,NULL

    Education,2010,NULL

    Mental Health,2006,Psychology

    Mental Health,2007,Social Work

    Mental Health,2008,NULL

    Mental Health,2009,NULL

    Mental Health,2010,NULL

    Human Services,2006,Nursing

    Human Services,2007,Physical Therapy

    Human Services,2008,NULL

    Human Services,2009,NULL

    Human Services,2010,NULL

    Again, in your example, you said "If you have 3 programs, 5 years, 2 majors" but based on the table structure that I have, you have to think of it on a program by program basis:

    Mental Health could just as easily have 2 years, and 5 majors (still 5 rows, just NULLs in a different spot).

    Human services could have 6 years and 1 major (six rows, NULLs for 2-6 of Major field).

    I don't think it's a Cartesian product, because I see what kind of result that would produce and it's not the same.

    Per what I'm reading in your statement, you'd have 5 rows (the number of rows in the largest table).

    If you were to create tables out of Program/<other field> combinations (e.g. Program/Year, Program/Major), then the number of rows I'd have for each program would be the number of rows in the largest table after having queried each table for that program.

    Each program could have a different number of rows.

    I know that's confusing, but that's why at the beginning I pasted the DDL to show what the table looks like and how it's storing multiple values (in a weird denormalized way).

    Right now, it looks like the usual denormalized data I've seen in dozens of data warehousing solutions. Per your statements, it's not like that at all. I need to see more in order to understand how/why.

    You're right, it's not like regular denormalized data, and I'm not sure why it's storing it that way. It doesn't make sense necessarily, but it is storing all of the data I would need (be it in an odd way) without duplicating. It just seems that if I wanted to isolate the field values for each program, I'd have to have criteria in my WHERE statement excluding NULLs for every query:

    --programs and years

    SELECT Program, Year

    FROM Programs

    WHERE Year is NOT NULL

    --programs and majors

    SELECT Program, Major

    FROM Programs

    WHERE Major IS NOT NULL

    --unique programs (will never be null)

    SELECT DISTINCT Program

    FROM Programs

    Notice how I need to query the same table for each of these pieces of information. I can't imagine that's very efficient. I'm not sure I have any other choice but to use queries similar to the ones above to get the data in a useful form. If extracting the data and putting them in to normalized tables is my only choice as a first step, then when I'm ready to build a full-fledged data warehouse, I'll worry about creating a star schema, or joining them back together to have a true denormalized situation.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • I need definitions for the source tables and a clear explanation of what end result you're trying to get.

    Here's part of what you posted as a sample result:

    Program,Year,Major

    Education,2006,Math

    Education,2007,Biology

    Education,2008,NULL

    Education,2009,NULL

    Education,2010,NULL

    Why would you only have Math in 2006, and only have Biology in 2007, and have nothing at all in the other years?

    Here's what I mean by joining the tables together in a cartesian product:

    CREATE TABLE #Programs (

    Program VARCHAR(50) PRIMARY KEY) ;

    CREATE TABLE #Years (Yr CHAR(4) PRIMARY KEY) ;

    CREATE TABLE #Majors (

    Major VARCHAR(50) PRIMARY KEY) ;

    INSERT INTO #Programs

    (Program)

    VALUES ('Education') ;

    INSERT INTO #Programs

    (Program)

    VALUES ('Mental Health') ;

    INSERT INTO #Programs

    (Program)

    VALUES ('Human Services') ;

    INSERT INTO #Years

    (Yr)

    VALUES ('2006') ;

    INSERT INTO #Years

    (Yr)

    VALUES ('2007') ;

    INSERT INTO #Years

    (Yr)

    VALUES ('2008') ;

    INSERT INTO #Years

    (Yr)

    VALUES ('2009') ;

    INSERT INTO #Years

    (Yr)

    VALUES ('2010') ;

    INSERT INTO #Majors

    (Major)

    VALUES ('Major1') ;

    INSERT INTO #Majors

    (Major)

    VALUES ('Major2') ;

    SELECT *

    FROM #Programs

    CROSS JOIN #Years

    CROSS JOIN #Majors

    ORDER BY Program,

    Yr,

    Major ;

    Now, what I need from you is to tell me what rules need to be applied to those tables. Is the data in #Majors tied by a key value of some sort to the data in #Programs, so that some majors apply to some programs and other majors apply to other programs? That's what it looks like from your example, but you haven't communicated what those rules are nor how you're currently looking at implementing them.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared,

    I'm not sure you're looking at this the same way that I am. You said:

    Here's part of what you posted as a sample result:

    Program,Year,Major

    Education,2006,Math

    Education,2007,Biology

    Education,2008,NULL

    Education,2009,NULL

    Education,2010,NULL

    Why would you only have Math in 2006, and only have Biology in 2007, and have nothing at all in the other years?

    That's why I said in a previous post "It's almost as if each record doesn't even matter, and the table must be looked at from a column perspective."

    It's NOT that Math is a major for the Education program in 2006 and Biology is a major for the Education program in 2007.

    The correct way to look at this is like this:

    The Education Program has two majors to it, Math and Biology. The Education Program has been in existence for years 2006, 2007, 2008, 2009, and 2010. Each individual field is related to Program, not to the other fields in its row.

    I'm not sure what else I can say about it and I could eventually get it into a form like you've described (Cartesian). Does it change anything now that you're viewing it from a columnar perspective?

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • You need one table with each row having a Program, a Major, and a Year. It will have one row for each year a major was included in a program. Those three columns are, combined together, the primary key. (Might be the clustered index, too. That'll depend on CRUD patterns in the data.)

    So, if Math was a major in the Education program in 2006 and 2007, that's two rows. If Science was a major in the Education program in 2007 through 2010, that's four more rows (2007, 2008, 2009, 2010).

    If you want to enforce the specific valid values for those columns, create lookup tables and foreign keys to them, but that's a data integrity issue, not a query issue.

    When you want data from it, you just query the one table.

    That's not a data warehouse of any sort, it's just a table.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (3/1/2011)


    You need one table with each row having a Program, a Major, and a Year. It will have one row for each year a major was included in a program. Those three columns are, combined together, the primary key. (Might be the clustered index, too. That'll depend on CRUD patterns in the data.)

    So, if Math was a major in the Education program in 2006 and 2007, that's two rows. If Science was a major in the Education program in 2007 through 2010, that's four more rows (2007, 2008, 2009, 2010).

    If you want to enforce the specific valid values for those columns, create lookup tables and foreign keys to them, but that's a data integrity issue, not a query issue.

    When you want data from it, you just query the one table.

    That's not a data warehouse of any sort, it's just a table.

    I know that's just a table. However, the example you just described, where one major is valid for one set of years within a program and another major is valid for another set of years within the same program would never happen. The system that creates this table would store it like this:

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    SET ANSI_PADDING ON

    GO

    CREATE TABLE [dbo].[Programs](

    [Program] [varchar](50) NULL,

    [Year] [int] NULL,

    [Major] [varchar](25) NULL,

    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    INSERT [dbo].[Programs] ([Program], [Year], [Major]) VALUES (N'Education', 2006, N'Math')

    INSERT [dbo].[Programs] ([Program], [Year], [Major]) VALUES (N'Education', 2007, NULL)

    INSERT [dbo].[Programs] ([Program], [Year], [Major]) VALUES (N'NewEduProgram', 2007, N'Science')

    INSERT [dbo].[Programs] ([Program], [Year], [Major]) VALUES (N'NewEduProgram', 2008, NULL)

    INSERT [dbo].[Programs] ([Program], [Year], [Major]) VALUES (N'NewEduProgram', 2009, NULL)

    INSERT [dbo].[Programs] ([Program], [Year], [Major]) VALUES (N'NewEduProgram', 2010, NULL)

    SELECT *

    FROM Programs

    A new year gets added each year, and new majors can be added, but there wouldn't be a case where "this major is valid for this set of years within this program" It's that all the Years apply to the program as long as they're in the table and all of the Majors apply to the program as long as they're in the table. If something needs to be year(s) specific, then they create a new Program. I know that's unconventional, especially in a RDBMS, but it is what it is for now.

    So my ultimate question is still to see if there is a (better, more efficient) way to query the original table to produce results similar to what you're describing.

    If I wanted to find out what were the valid majors and years for the "NewEduProgram" program, this is the messy way I can think to do it:

    SELECT a.Program,

    Major,

    Year

    FROM (SELECT Program,

    Year

    FROM Programs

    WHERE Year IS NOT NULL) a

    INNER JOIN (SELECT Program,

    Major

    FROM Programs

    WHERE Major IS NOT NULL) b

    ON a.Program = b.Program

    WHERE a.Program = 'NewEduProgram'

    ORDER BY Major,

    Year

    In other words, if you could come up with a cleaner, more efficient technique that produces the same results as this one, that would be really good.

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Then you're looking at two tables. One that joins majors to programs, one that joins years to programs. This could be a Majors table that has an FK back to Programs, and the Programs table has a "StartYear" and "EndYear" in it, with the possibility that it's only one year being recorded as the same value in both columns, et al. That will normalize it correctly, and queries become easy to build.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (3/1/2011)


    Then you're looking at two tables. One that joins majors to programs, one that joins years to programs. This could be a Majors table that has an FK back to Programs, and the Programs table has a "StartYear" and "EndYear" in it, with the possibility that it's only one year being recorded as the same value in both columns, et al. That will normalize it correctly, and queries become easy to build.

    But the two tables approach is the only way you can think of to set it up for easy querying? No way to do it the way it is?

    I was just getting a feel for what everyone thought. I'm 99% sure I'm going to normalize all of the tables that are like this.

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • You can do it with one table, by denormalizing, but you'll have one row for each Program, multiplied by the number of Majors, multipled by the number of Years. Otherwise, you have to jump through all kinds of weird hoops to query it and you can expect performance to be horrible if you have more than a very small number of rows.

    Edit to clarify: If you do the one-table version that has one row for each Program, Year, and Major (full cartesian product, as per my first sample with the temp tables and the cross joins), that will be easy to query and run reports against. It just won't update/insert/delete efficiently. That's always the trade-off in normalized vs denormalized. It only gets complicated if you do one table, and some of the rows have years and some don't, and some of the rows have majors and some don't.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply