April 6, 2011 at 4:52 pm
Hello All,
I am attempting to design a model for a marketing/publishing application. The data once collected will be pushed out to presentations, printed material, and a website. Much of the data exists elsewhere and will need to be extracted from the primary systems into the system I am designing. Other data will be entered manually. I will need to keep historical data (not captured in my model) but really the data being used will be the current data. I don't expect performance issues with the amount of data I foresee.
See two proposed data models below. The second model seems more traditional to me but my experience is mostly with transactional systems. The first model seems more flexible to me.
The quantity of metrics are much greater than what I am presenting here. New metrics will be added fairly continuously.
My thinking with model 2 is that is someone says to my 'I want to add the number of parks' I can just add the 'Number of parks' to the metrics table and not have to modify the City table which I would have to do in model 1.
It would be great to hear comments on this. Model 2 seems workable but not having worked on a model like this before I am quite unsure.
One drawback I can clearly see is that a single column could have both numeric and text data.
Another is, what do I do if there is a third tier of data? SubSubDataPoint.
How extreme can I take this? Should I add a Metric called 'People' with an ID of 5 and then have:
10,'Mary','First'
10,Jones,Last
10,5555558888,HomePhone
...
It seems like i can jam every database I have seen into this model.
Does the type of model in model2 have a name? Is that name ridiculous?
--Model1
CREATE TABLE Metrics
(
MetricID int,
MetricName varchar(30),
MetricCategory varchar(20),
)
INSERT INTO Metrics
SELECT 1, 'Name','City' UNION
SELECT 2, 'Population','City' UNION
SELECT 3, 'Age Distribution','City' UNION
SELECT 4, 'Average Monthly Temperature','City'
CREATE TABLE MetricData
(
MetricID int,
Datapoint varchar(20),
SubDatapoint varchar(20)
)
INSERT INTO MetricData
SELECT 1, 'Nowheresville','' UNION
SELECT 2, '100','' UNION
SELECT 3, '1','Under 25' UNION
SELECT 3, '5','25-50' UNION
SELECT 3, '4','51-70' UNION
SELECT 3, '90','71+' UNION
SELECT 4, '13','Jan' UNION
SELECT 4, '14','Feb' UNION
SELECT 4, '15','March'
SELECT * FROM Metrics
SELECT * FROM MetricData
DROP TABLE Metrics
DROP TABLE MetricData
--Model2
CREATE TABLE City
(
Name varchar(20),
Population int
)
INSERT INTO City
SELECT 'Nowheresville',100
CREATE TABLE CityPopulation
(
Population int,
Range varchar(10)
)
INSERT INTO CityPopulation
SELECT '1','Under 25' UNION
SELECT '5','25-50' UNION
SELECT '4','51-70' UNION
SELECT '90','71+'
CREATE TABLE CityMontlyAverageTemperature
(
Temperature int,
Month varchar(10)
)
INSERT INTO CityMontlyAverageTemperature
SELECT '13','Jan' UNION
SELECT '14','Feb' UNION
SELECT '15','mar'
SELECT * FROM City
SELECT * FROM CityPopulation
SELECT * FROM CityMontlyAverageTemperature
DROP TABLE City
DROP TABLE CityPopulation
DROP TABLE CityMontlyAverageTemperature
April 6, 2011 at 11:27 pm
Chrissy321 (4/6/2011)
See two proposed data models below. The second model seems more traditional to me but my experience is mostly with transactional systems. The first model seems more flexible to me.
Flexible, yes. Two things to think about. 1) How much data are we really talking about here, and 2) if this works, how quickly is this going to grow out of initial scope when other people adopt it?
My thinking with model 2 is that is someone says to my 'I want to add the number of parks' I can just add the 'Number of parks' to the metrics table and not have to modify the City table which I would have to do in model 1.
I think from here on in you swapped models 2 and 1 back and forth in your descriptions.
Let's call these models: Normal Form and One True Lookup Table.
If you've been banging around the forums a bit you'll already see a problem here. 🙂
The OTLT (not my definition, but I've happily stolen it) is a known performance drag. It comes from an XML style of thinking and yes it's quite flexible. Now, try to do any math. Numeric and varchar data do not mix because you cannot control the order in which a query will apply operators. It may sum everything and then use your where clause (in which case, you error), or it may where clause, then sum (in which case, you're fine). You can never be sure without going through some hoops.
Add to this factor when you're trying to get at one or two specific types of data, you have to churn through a monster. You could easily scan a 50 row table, but now you have to seek (or possibly scan) the monster to find your 50 rows.
From a convenience perspective, the OTLT design is very convenient. Right up to about a thousand rows. Then you're going to want your desk padded for when you beat your head on it.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 7, 2011 at 8:56 am
Yes I did mix up my descriptions as you noticed. Your terms make a lot more sense.
You concerns are definitely valid. How much data? A thousand rows of current data is probably about right. The most recent data is primarily the concern. Its not a data mining project (in its current scope).
I'd probably have a Sessions table with a session id and a timestamp. Every time I add data I'd create a Session and then populate the MetricData table with that SsessionID.
So my queries would look like.
--get the most recent population
SELECT
Datapoint
FROM MetricData
WHERE SessionID (SELECT MAX(SessionID) FROM MetricData WHERE MetricID = 2)
So perfromance is not a huge concern. The data inserted into MetricData would be summarized data so I don't anticipate a lot of calculating.
Can you tell I am trying to talk myself into this...
I have heard of OTLT and have never gone down that road, I thought OTLT was strictly a lookup strategy. I seem to be going further by contemplating storing my actual data in this structure.
April 7, 2011 at 11:22 am
Chrissy321 (4/7/2011)
--get the most recent populationSELECT
Datapoint
FROM MetricData
WHERE SessionID (SELECT MAX(SessionID) FROM MetricData WHERE MetricID = 2)
If you can avoid this I would. There are times things like this are necessary, but you would typically prefer to avoid them.
So perfromance is not a huge concern. The data inserted into MetricData would be summarized data so I don't anticipate a lot of calculating.
That may actually be worse.
Can you tell I am trying to talk myself into this...
I have heard of OTLT and have never gone down that road, I thought OTLT was strictly a lookup strategy. I seem to be going further by contemplating storing my actual data in this structure.
Nope, the OTLT is a storage strategy that you then pound your head on the desk because of the lookup methodology necessary to make it work well.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 7, 2011 at 11:29 am
>>If you can avoid this I would. There are times things like this are necessary, but you would
typically prefer to avoid them.
How would I avoid them? Could I have a table that always holds the most recent data and when new data is generated the old data in pushed into an archive table?
Thanks for your comments, I am definintely in pause and reflect mode rather than full straight ahead.
April 7, 2011 at 12:10 pm
http://www.sqlservercentral.com/articles/Database+Design/62386/
Here the article I have been looking for.
April 7, 2011 at 12:26 pm
Chrissy321 (4/7/2011)
>>If you can avoid this I would. There are times things like this are necessary, but you wouldtypically prefer to avoid them.
How would I avoid them? Could I have a table that always holds the most recent data and when new data is generated the old data in pushed into an archive table?
Thanks for your comments, I am definintely in pause and reflect mode rather than full straight ahead.
Yeah that article is pretty on target. Sorry about the 'worse' comment earlier, I misread what you meant about summarizing the data before hand. It'll be about the same for that.
As to the session id, I'd have to have a better idea of what you're trying to do. Are you trying to log the same information just with multiple days? Kind of like a running log file of activity at certain points? For example of what I'm poorly saying, think of a barometer measuring pressure and logging it every 15 minutes.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 7, 2011 at 1:37 pm
Actually the SessionID would probably just be useful for tracking the insertion time of the data so I might have driven off the track a bit there.
My model is largely based on monthly or quarterly publishing/reporting of information. So my values might look like this.
CREATE TABLE AtrributeValues
(
SessionID int,
AttributeID int,
Value varchar(20),
GoodThruDate datetime
)
So if I wanted to publish data for a specific attribute for a particular month I might use:
SELECT
Value
FROM AtrributeValues
WHERE
SessionID =
(
SELECT MAX(SessionID)
FROM AtrributeValues
WHERE
AttributeID = 2 AND
GoodThruDate = '3/31/2011'
)
So get the most recently published data for Attribute 2 for March 2011.
The max sessionID could be used for for revisions/corrections, that is get the most recent value for the specified goodthrudate assuming the last data inserted is the correct data.
Sorry I'm thinking out loud here but hey I'm in the database design forum.
Also my target data is financial analytics not my hypothetical city. In this world folks can make up new data pretty easily. And then they want to modify the analytic (and keep the old one(s)) or three different people have different interpretations of what of what x means leading to Bob's X and Joe's x etc. This is what lead my to this model.
Now that I know what to call this (EAV) I'm light years ahead in understanding what I may be stepping into!
April 8, 2011 at 1:40 pm
Chrissy321 (4/7/2011)
The max sessionID could be used for for revisions/corrections, that is get the most recent value for the specified goodthrudate assuming the last data inserted is the correct data.
Work's quite hectic at the moment so sorry about the delay in this conversation. So, you are definately producing the equivalent of a logging/historical audit table. Yes, you're kind of stuck with that mechanic then. I personally might split your attribute 2 out to another table to speed the lookups to find the correct session Id THEN come back to the EAV table to pull your results, just out of self defense. The EAV table is going to be rather large to go polling in that method.
Sorry I'm thinking out loud here but hey I'm in the database design forum.
Also my target data is financial analytics not my hypothetical city. In this world folks can make up new data pretty easily. And then they want to modify the analytic (and keep the old one(s)) or three different people have different interpretations of what of what x means leading to Bob's X and Joe's x etc. This is what lead my to this model.
Now that I know what to call this (EAV) I'm light years ahead in understanding what I may be stepping into!
If you're dealing with financials, definately create a table that's purely numeric (38,8 is probably a good baseline) and another for text if you're going to pursue this. You will feel less pain, though you will have to help train users to know where to put data. Either that or dual column the EAV table so that you aren't trying to fight with ISNUMERIC() query issues.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
April 11, 2011 at 2:21 pm
This is powerful stuff. I still haven't fully committed to going down this rabbit hole.
I like the idea about splitting the numeric and character values.
The attribute 2 or subdatatype as I previous called it is disconcerting. See my revised model below which eliminates this. In this model the attribute can have a category. If there is a greater hierarchy I'll push the data into a traditional table.
My design is already going to be hybrid. My entities themselves will be traditional and there will be additional normalized tables.
Since I wouldn't want to tackle hierarchies and EAV at the same time a single category will hopefully be sufficient.
--New Model
CREATE TABLE Attributes
(
AttributeID int,
AttributeCategory varchar(30),
AttributeName varchar(30),
AttributeDataType varchar(10),
)
INSERT INTO Attributes
SELECT 1, 'Population Distribution', 'Under 25','Numeric' UNION
SELECT 2, 'Population Distribution', '25-50' ,'Numeric' UNION
SELECT 3, 'Population Distribution', '51-70','Numeric' UNION
SELECT 4, 'Population Distribution', '51-70','Numeric' UNION
SELECT 5, 'Population Distribution', '71+','Numeric'
CREATE TABLE AttributeValues
(
EntityID int,
AttributeID int,
Value_Numeric decimal (38,8),
Value_Character varchar (25)
)
INSERT INTO AttributeValues
SELECT 1,1,20,NULL UNION
SELECT 1,2,30,NULL UNION
SELECT 1,3,15,NULL UNION
SELECT 1,4,5,NULL UNION
SELECT 1,5,30,NULL
SELECT * FROM Attributes
SELECT * FROM AttributeValues
DROP TABLE Attributes
DROP TABLE AttributeValues
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply