Need help designing a data model

  • Hello All,

    I am attempting to design a model for a marketing/publishing application. The data once collected will be pushed out to presentations, printed material, and a website. Much of the data exists elsewhere and will need to be extracted from the primary systems into the system I am designing. Other data will be entered manually. I will need to keep historical data (not captured in my model) but really the data being used will be the current data. I don't expect performance issues with the amount of data I foresee.

    See two proposed data models below. The second model seems more traditional to me but my experience is mostly with transactional systems. The first model seems more flexible to me.

    The quantity of metrics are much greater than what I am presenting here. New metrics will be added fairly continuously.

    My thinking with model 2 is that is someone says to my 'I want to add the number of parks' I can just add the 'Number of parks' to the metrics table and not have to modify the City table which I would have to do in model 1.

    It would be great to hear comments on this. Model 2 seems workable but not having worked on a model like this before I am quite unsure.

    One drawback I can clearly see is that a single column could have both numeric and text data.

    Another is, what do I do if there is a third tier of data? SubSubDataPoint.

    How extreme can I take this? Should I add a Metric called 'People' with an ID of 5 and then have:

    10,'Mary','First'

    10,Jones,Last

    10,5555558888,HomePhone

    ...

    It seems like i can jam every database I have seen into this model.

    Does the type of model in model2 have a name? Is that name ridiculous?

    --Model1

    CREATE TABLE Metrics

    (

    MetricID int,

    MetricName varchar(30),

    MetricCategory varchar(20),

    )

    INSERT INTO Metrics

    SELECT 1, 'Name','City' UNION

    SELECT 2, 'Population','City' UNION

    SELECT 3, 'Age Distribution','City' UNION

    SELECT 4, 'Average Monthly Temperature','City'

    CREATE TABLE MetricData

    (

    MetricID int,

    Datapoint varchar(20),

    SubDatapoint varchar(20)

    )

    INSERT INTO MetricData

    SELECT 1, 'Nowheresville','' UNION

    SELECT 2, '100','' UNION

    SELECT 3, '1','Under 25' UNION

    SELECT 3, '5','25-50' UNION

    SELECT 3, '4','51-70' UNION

    SELECT 3, '90','71+' UNION

    SELECT 4, '13','Jan' UNION

    SELECT 4, '14','Feb' UNION

    SELECT 4, '15','March'

    SELECT * FROM Metrics

    SELECT * FROM MetricData

    DROP TABLE Metrics

    DROP TABLE MetricData

    --Model2

    CREATE TABLE City

    (

    Name varchar(20),

    Population int

    )

    INSERT INTO City

    SELECT 'Nowheresville',100

    CREATE TABLE CityPopulation

    (

    Population int,

    Range varchar(10)

    )

    INSERT INTO CityPopulation

    SELECT '1','Under 25' UNION

    SELECT '5','25-50' UNION

    SELECT '4','51-70' UNION

    SELECT '90','71+'

    CREATE TABLE CityMontlyAverageTemperature

    (

    Temperature int,

    Month varchar(10)

    )

    INSERT INTO CityMontlyAverageTemperature

    SELECT '13','Jan' UNION

    SELECT '14','Feb' UNION

    SELECT '15','mar'

    SELECT * FROM City

    SELECT * FROM CityPopulation

    SELECT * FROM CityMontlyAverageTemperature

    DROP TABLE City

    DROP TABLE CityPopulation

    DROP TABLE CityMontlyAverageTemperature

  • Chrissy321 (4/6/2011)


    See two proposed data models below. The second model seems more traditional to me but my experience is mostly with transactional systems. The first model seems more flexible to me.

    Flexible, yes. Two things to think about. 1) How much data are we really talking about here, and 2) if this works, how quickly is this going to grow out of initial scope when other people adopt it?

    My thinking with model 2 is that is someone says to my 'I want to add the number of parks' I can just add the 'Number of parks' to the metrics table and not have to modify the City table which I would have to do in model 1.

    I think from here on in you swapped models 2 and 1 back and forth in your descriptions.

    Let's call these models: Normal Form and One True Lookup Table.

    If you've been banging around the forums a bit you'll already see a problem here. 🙂

    The OTLT (not my definition, but I've happily stolen it) is a known performance drag. It comes from an XML style of thinking and yes it's quite flexible. Now, try to do any math. Numeric and varchar data do not mix because you cannot control the order in which a query will apply operators. It may sum everything and then use your where clause (in which case, you error), or it may where clause, then sum (in which case, you're fine). You can never be sure without going through some hoops.

    Add to this factor when you're trying to get at one or two specific types of data, you have to churn through a monster. You could easily scan a 50 row table, but now you have to seek (or possibly scan) the monster to find your 50 rows.

    From a convenience perspective, the OTLT design is very convenient. Right up to about a thousand rows. Then you're going to want your desk padded for when you beat your head on it.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Yes I did mix up my descriptions as you noticed. Your terms make a lot more sense.

    You concerns are definitely valid. How much data? A thousand rows of current data is probably about right. The most recent data is primarily the concern. Its not a data mining project (in its current scope).

    I'd probably have a Sessions table with a session id and a timestamp. Every time I add data I'd create a Session and then populate the MetricData table with that SsessionID.

    So my queries would look like.

    --get the most recent population

    SELECT

    Datapoint

    FROM MetricData

    WHERE SessionID (SELECT MAX(SessionID) FROM MetricData WHERE MetricID = 2)

    So perfromance is not a huge concern. The data inserted into MetricData would be summarized data so I don't anticipate a lot of calculating.

    Can you tell I am trying to talk myself into this...

    I have heard of OTLT and have never gone down that road, I thought OTLT was strictly a lookup strategy. I seem to be going further by contemplating storing my actual data in this structure.

  • Chrissy321 (4/7/2011)


    --get the most recent population

    SELECT

    Datapoint

    FROM MetricData

    WHERE SessionID (SELECT MAX(SessionID) FROM MetricData WHERE MetricID = 2)

    If you can avoid this I would. There are times things like this are necessary, but you would typically prefer to avoid them.

    So perfromance is not a huge concern. The data inserted into MetricData would be summarized data so I don't anticipate a lot of calculating.

    That may actually be worse.

    Can you tell I am trying to talk myself into this...

    I have heard of OTLT and have never gone down that road, I thought OTLT was strictly a lookup strategy. I seem to be going further by contemplating storing my actual data in this structure.

    Nope, the OTLT is a storage strategy that you then pound your head on the desk because of the lookup methodology necessary to make it work well.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • >>If you can avoid this I would. There are times things like this are necessary, but you would

    typically prefer to avoid them.

    How would I avoid them? Could I have a table that always holds the most recent data and when new data is generated the old data in pushed into an archive table?

    Thanks for your comments, I am definintely in pause and reflect mode rather than full straight ahead.

  • http://www.sqlservercentral.com/articles/Database+Design/62386/

    Here the article I have been looking for.

  • Chrissy321 (4/7/2011)


    >>If you can avoid this I would. There are times things like this are necessary, but you would

    typically prefer to avoid them.

    How would I avoid them? Could I have a table that always holds the most recent data and when new data is generated the old data in pushed into an archive table?

    Thanks for your comments, I am definintely in pause and reflect mode rather than full straight ahead.

    Yeah that article is pretty on target. Sorry about the 'worse' comment earlier, I misread what you meant about summarizing the data before hand. It'll be about the same for that.

    As to the session id, I'd have to have a better idea of what you're trying to do. Are you trying to log the same information just with multiple days? Kind of like a running log file of activity at certain points? For example of what I'm poorly saying, think of a barometer measuring pressure and logging it every 15 minutes.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Actually the SessionID would probably just be useful for tracking the insertion time of the data so I might have driven off the track a bit there.

    My model is largely based on monthly or quarterly publishing/reporting of information. So my values might look like this.

    CREATE TABLE AtrributeValues

    (

    SessionID int,

    AttributeID int,

    Value varchar(20),

    GoodThruDate datetime

    )

    So if I wanted to publish data for a specific attribute for a particular month I might use:

    SELECT

    Value

    FROM AtrributeValues

    WHERE

    SessionID =

    (

    SELECT MAX(SessionID)

    FROM AtrributeValues

    WHERE

    AttributeID = 2 AND

    GoodThruDate = '3/31/2011'

    )

    So get the most recently published data for Attribute 2 for March 2011.

    The max sessionID could be used for for revisions/corrections, that is get the most recent value for the specified goodthrudate assuming the last data inserted is the correct data.

    Sorry I'm thinking out loud here but hey I'm in the database design forum.

    Also my target data is financial analytics not my hypothetical city. In this world folks can make up new data pretty easily. And then they want to modify the analytic (and keep the old one(s)) or three different people have different interpretations of what of what x means leading to Bob's X and Joe's x etc. This is what lead my to this model.

    Now that I know what to call this (EAV) I'm light years ahead in understanding what I may be stepping into!

  • Chrissy321 (4/7/2011)


    The max sessionID could be used for for revisions/corrections, that is get the most recent value for the specified goodthrudate assuming the last data inserted is the correct data.

    Work's quite hectic at the moment so sorry about the delay in this conversation. So, you are definately producing the equivalent of a logging/historical audit table. Yes, you're kind of stuck with that mechanic then. I personally might split your attribute 2 out to another table to speed the lookups to find the correct session Id THEN come back to the EAV table to pull your results, just out of self defense. The EAV table is going to be rather large to go polling in that method.

    Sorry I'm thinking out loud here but hey I'm in the database design forum.

    Also my target data is financial analytics not my hypothetical city. In this world folks can make up new data pretty easily. And then they want to modify the analytic (and keep the old one(s)) or three different people have different interpretations of what of what x means leading to Bob's X and Joe's x etc. This is what lead my to this model.

    Now that I know what to call this (EAV) I'm light years ahead in understanding what I may be stepping into!

    If you're dealing with financials, definately create a table that's purely numeric (38,8 is probably a good baseline) and another for text if you're going to pursue this. You will feel less pain, though you will have to help train users to know where to put data. Either that or dual column the EAV table so that you aren't trying to fight with ISNUMERIC() query issues.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • This is powerful stuff. I still haven't fully committed to going down this rabbit hole.

    I like the idea about splitting the numeric and character values.

    The attribute 2 or subdatatype as I previous called it is disconcerting. See my revised model below which eliminates this. In this model the attribute can have a category. If there is a greater hierarchy I'll push the data into a traditional table.

    My design is already going to be hybrid. My entities themselves will be traditional and there will be additional normalized tables.

    Since I wouldn't want to tackle hierarchies and EAV at the same time a single category will hopefully be sufficient.

    --New Model

    CREATE TABLE Attributes

    (

    AttributeID int,

    AttributeCategory varchar(30),

    AttributeName varchar(30),

    AttributeDataType varchar(10),

    )

    INSERT INTO Attributes

    SELECT 1, 'Population Distribution', 'Under 25','Numeric' UNION

    SELECT 2, 'Population Distribution', '25-50' ,'Numeric' UNION

    SELECT 3, 'Population Distribution', '51-70','Numeric' UNION

    SELECT 4, 'Population Distribution', '51-70','Numeric' UNION

    SELECT 5, 'Population Distribution', '71+','Numeric'

    CREATE TABLE AttributeValues

    (

    EntityID int,

    AttributeID int,

    Value_Numeric decimal (38,8),

    Value_Character varchar (25)

    )

    INSERT INTO AttributeValues

    SELECT 1,1,20,NULL UNION

    SELECT 1,2,30,NULL UNION

    SELECT 1,3,15,NULL UNION

    SELECT 1,4,5,NULL UNION

    SELECT 1,5,30,NULL

    SELECT * FROM Attributes

    SELECT * FROM AttributeValues

    DROP TABLE Attributes

    DROP TABLE AttributeValues

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply