Normalization Question

Question

Post reply

Normalization Question

Mike Scalise

SSCrazy Eights

Points: 8145
More actions
February 24, 2011 at 10:10 am

#232649

I have a system here that is refreshing data and storing multiple values in an odd way (and I have no control over how it stores the data). How would you change this to make it easier to run queries and report off of?
Here is an example that is similar to the real situation:
I have one product in the table called "Shirt." It has six colors and four sizes for each year (and it does have every combination, and assume it always will).
Here's how it's being stored:
Product Year Color Size
-------------------------------------------------- ----------- -------------------------
Shirt 2006 GREEN S
Shirt 2007 BLUE M
Shirt 2008 RED L
Shirt 2009 YELLOW XL
Shirt 2010 BLACK NULL
Shirt 2011 WHITE NULL
Shirt 2012 NULL NULL
As far denormalized tables go, I would think there should be 24 rows for each Shirt/Year combination, right?
Product Year Color Size
-------------------------------------------------- ----------- -------------------------
Shirt 2006 GREEN S
Shirt 2006 GREEN M
Shirt 2006 GREEN L
Shirt 2006 GREEN XL
Shirt 2006 BLUE S
Shirt 2006 BLUE M
Shirt 2006 BLUE L
Shirt 2006 BLUE XL
.
.
.
So is there some obscure querying that I'm not aware of that would make the original table a valid way to keep the data, or should I normalize it by essentially storing almost every column in its own table, creating bridge tables, then joining it all back together to end up with the 24 rows per product/year? Does it make sense what I'm trying to do and how I'm thinking about accomplishing it?
Thanks,
Mike
If you want to create the original table:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Products](
[Product] [varchar](50) NULL,
[Year] [int] NULL,
[Color] [varchar](25) NULL,
[Size] [varchar](10) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
INSERT [dbo].[Products] ([Product], [Year], [Color], [Size]) VALUES (N'Shirt', 2006, N'GREEN', N'S')
INSERT [dbo].[Products] ([Product], [Year], [Color], [Size]) VALUES (N'Shirt', 2007, N'BLUE', N'M')
INSERT [dbo].[Products] ([Product], [Year], [Color], [Size]) VALUES (N'Shirt', 2008, N'RED', N'L')
INSERT [dbo].[Products] ([Product], [Year], [Color], [Size]) VALUES (N'Shirt', 2009, N'YELLOW', N'XL')
INSERT [dbo].[Products] ([Product], [Year], [Color], [Size]) VALUES (N'Shirt', 2010, N'BLACK', NULL)
INSERT [dbo].[Products] ([Product], [Year], [Color], [Size]) VALUES (N'Shirt', 2011, N'WHITE', NULL)
INSERT [dbo].[Products] ([Product], [Year], [Color], [Size]) VALUES (N'Shirt', 2012, NULL, NULL)
Mike Scalise, PMP
https://www.michaelscalise.com

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

danschl SSCertifiable Points: 5770 More actions · Answer 1

usually there is more to the tables than that

so products,sizes, colors and years would make since

normalize it for future issues and flexiblity

the size of the xref file will be the same either way in rows

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 2

It seems odd though that I'd have to normalize it just to turn around and denormalize again, right?

And yes, I agree the xref table will have the same number of rows, but it's almost a necessary evil if I want to get it into a proper denormalized form -- again, unless there's a way of querying (or maybe a function) that would explode everything out in the original table.

Mike Scalise, PMP
https://www.michaelscalise.com

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 3

I would designate the values you have for Products (Shirt), Colors (Green, Blue, etc.) and Sizes (S, M, L, etc.) as "natural keys". Surrogate keys (IDENTITY) are my preferred way to identify these entities since they typically take up a smaller amount of space than the corresponding natural key and therefore improve performance (generally) by requiring less resources (see below for a "that said").

If you add a table to store colors then Green may be known as the number 1, Blue as 2, Red as 3, etc. If you choose tinyint as your "color ID" then you only need 1 byte to say something is Blue as opposed to 5 (4-bytes for each letter plus one for your length byte since you chose VARCHAR). This will save you disk space for data as well as index storage as well as memory space when interacting with that data in the database.

That said, you have to balance this savings with the cost of always joining your bridge tables to your lookup tables to reconstruct your denormalized data set. There is a reason why databases mostly used for reporting have more denormalized tables than say a database serving an OLTP database. In general a highly normalized data model will perform better for serving OLTP applications so it depends on what you're doing as to which direction you should choose.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 4

opc.three (2/24/2011)
I would designate the values you have for Products (Shirt), Colors (Green, Blue, etc.) and Sizes (S, M, L, etc.) as "natural keys". Surrogate keys (IDENTITY) are my preferred way to identify these entities since they typically take up a smaller amount of space than the corresponding natural key and therefore improve performance (generally) by requiring less resources (see below for a "that said").
If you add a table to store colors then Green may be known as the number 1, Blue as 2, Red as 3, etc. If you choose tinyint as your "color ID" then you only need 1 byte to say something is Blue as opposed to 5 (4-bytes for each letter plus one for your length byte since you chose VARCHAR). This will save you disk space for data as well as index storage as well as memory space when interacting with that data in the database.
That said, you have to balance this savings with the cost of always joining your bridge tables to your lookup tables to reconstruct your denormalized data set. There is a reason why databases mostly used for reporting have more denormalized tables than say a database serving an OLTP database. In general a highly normalized data model will perform better for serving OLTP applications so it depends on what you're doing as to which direction you should choose.

opc,

This table was just for demonstration purposes. My real table does have an ID that serves as the identity and primary key for the table, so I know how I would normalize them, and I understand that doing so would make the most sense if this were being used for OLTP. However, the ultimate purpose of this is for reporting, so I still wonder if the best approach here is to 1) Normalize the given table (Which I have no control over how it's being created), 2) Join the normalized tables, 3) save the resulting (denormalized) data set in the data warehouse?

That seems to be the way that makes the most sense. Do you agree?

Mike Scalise, PMP
https://www.michaelscalise.com

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 5

As with most things it will depend on a lot of factors...more than we can discuss in this forum. However as an initial direction for supporting reporting functions denormalizing the table would be a reasonable choice.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

GSquared SSC Guru Points: 260824 More actions · Answer 6

What you're hitting here is the age-old conflict on how best to architect data warehouses. One is "star/snowflake schema", one is "normalized, but not fully". (There are real names for them and a lot of data, if you look up data warehousing online. Wikipedia had good articles on it the last time I looked, but is of course subject to change without notice.)

For reporting, you'd either want to denormalize like crazy, or turn it into a snowflake, in this case.

You obviously won't be reporting "what combinations were available in a given year", if the answer is always, "all of them". What you'd more likely be reporting would be sales volume, overstock/understock volume, sales by month/season, sales by region/country/store, and so on.

Right now, this looks like homework, because of the simplicity. A real database on the subject would almost never be limited to "shirt" as a valid product description. If it is homework, take a good look at some articles or books on data warehousing, and design something based on those.

If it's real, it would be helpful to have more details on what kind of reports you'll actually anticipate running. The type of reporting makes a big difference in the warehousing methodology to pick.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 7

It's not homework, it has to do with academic programs.

Shirt = Program (e.g. BS in Math)

Year = Academic Year

Color = Major

Size = Specialization

The number of rows per program in the original table will be the highest of count(academic years), count(majors), or count(specializations), so if there are 10 years, 2 majors and and 2 specializations, there will be 10 rows for that major, with values in rows 1-2 in majors and specializations and NULL for rows 3-10. Each of the fields independently apply to the Program field. It's almost as if each record doesn't even matter, and the table must be looked at from a column perspective.

To remove the NULLS in the fields, I'd have to pair each field with the Program field:

Program/Academic Year, Program/Major, Program/Specialization, etc.

GSquared, I know there are many ways to design a data warehouse, and I'm sure I'll be doing some research into that, but I was really just trying to see if anyone had come across a storage method similar to the one I originally described, and if, regardless of the data warehouse model I use, it makes sense to get it out of its current form.

Mike

Mike Scalise, PMP
https://www.michaelscalise.com

GSquared SSC Guru Points: 260824 More actions · Answer 8

Yes, the format you describe in the original post is used quite frequently in reporting databases. Normalization doesn't matter as much as query speed in those. Very common model.

Whether it makes sense to keep it that way will depend a lot on how you're going to use it.

If, for example, data mining will be done where those dimensions become independant of each other, then that table structure should be avoided and a snowflake schema used instead. That will better support things like pivot tables with dynamic row/column/aggregate formats.

If, on the other hand, it will mainly be used in flat reports, without dynamic content, but with variable aggregates, then it will make sense to keep it as flat as possible.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 9

GSquared, how would one report off of a structure like that? I suppose you could use derived tables, only selecting 2 columns at a time. For example:

SELECT program, major

FROM Programs

WHERE major IS NOT NULL

right?

Is there a cleaner way to do it. It seems like that'd be a lot of queries to the database to get all of the information you'd need to build a report, defeating the purpose of a data warehouse. I know you said that this method of storing multiple values is common. Does it have a name so I can research how people report off of it?

Mike Scalise, PMP
https://www.michaelscalise.com

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 10

GSquared already mentioned it in a previous post:

GSquared (2/24/2011)
What you're hitting here is the age-old conflict on how best to architect data warehouses. One is "star/snowflake schema", one is "normalized, but not fully". (There are real names for them and a lot of data, if you look up data warehousing online. Wikipedia had good articles on it the last time I looked, but is of course subject to change without notice.)
For reporting, you'd either want to denormalize like crazy, or turn it into a snowflake, in this case

Have a look online for "star schema for data warehouse".

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 11

I don't think a Star Schema accurately reflects what I'm seeing in my table. Are you sure that's the type that's used in my situation? Again, I have a lot of NULLs in my table and haven't been able to find a setup that's similar. I know you guys said it was fairly common.

Mike Scalise, PMP
https://www.michaelscalise.com

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 12

You were asking about different approaches...a denormalized table approach is what you described in the hypothetical table in your original post. Implementing a star schema is an alternative to that. Check it out online...it's a more complex than the denormalized table approach but as is usually a reward for adding complexity you'll have more flexibility and power over how you can query your data.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Mike Scalise SSCrazy Eights Points: 8145 More actions · Answer 13

In one sense, yes I was asking about data warehousing approaches, but what I was really looking for is an explanation of the type of table that I described and the best way to query it. You say that my table is indicative of a denormalized approach to data warehousing, but from my experience, a denormalized data warehouse has a lot of data redundancy. My table has no duplicated values with the exception of the Program. It's not fully denormalized, and it's not part of a star/snowflake schema, As I said before, the number of rows for each Program in the table is the highest of count() on each of the related fields. The rest of the fields have NULL where the row > their count().

Mike Scalise, PMP
https://www.michaelscalise.com

GSquared SSC Guru Points: 260824 More actions · Answer 14

From what I can see of your table, you have essentially a full cartesian join of three tables: Program, Year, Major. That means you have a number of rows equal to the number of rows in each table multiplied together. If you have 3 programs, 5 years, 2 majors, you have 30 rows (3 * 5 * 2). That's what it looks like to me.

Per what I'm reading in your statement, you'd have 5 rows (the number of rows in the largest table).

I'm obviously missing something, but I don't know what.

In order to help, I need a more clear definition of the tables involved and the data in them. Right now, it looks like the usual denormalized data I've seen in dozens of data warehousing solutions. Per your statements, it's not like that at all. I need to see more in order to understand how/why.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon