August 23, 2007 at 4:53 am
Hi,
Sincere apologies in advance for the long post.
I've spent 6 months creating an Operational Data Store for my company in W2k3 SS 2000 sp4. The initial brief was consolidation of existing reporting services into a central repository within a tight timescale and budget. The service has been running successfully now for some time and i now have the opportunity to spend additional time and funds getting it exactly how i want it.
There are a few main issues i'm not sure how to deal with.
The data comes from a central 3rd party ERP solution that is written in some obscure language and sure as hell isn't relational in it's approach. This means many of the tables i am working with are not normalised. That is not to say there is no data integrity-it's life Jim but not as we know it. The main relationships that exist between tables are sound but many of the tables themselves are more like you'd expect to see in a reporting service or contain loads of calculated data.
My previous approach was to generate copies of the tables and simply provide reports where possible. In some cases i had to do some transformations on loading the data to create new tables which would allow me to write more efficient queries but it was an 80% 1:1 copy - if you can have such a thing. Many reports are based on a single table and a lot of where clauses
I started all of this with no schema for the ERP as none exists. With the knowledge i've gained I now have an ERD for the database. I'm now considering loading the data into a normalised structure and then denormalising it with views for reporting. I'm hoping that this approach will mean
a) I can do some proper data integrity checking
b) I'm storing a lot less data
c) Anyone who comes after me will be able to see a much clearer picture of our business model.
The downside is i will end up with many more tables in order to create my views - in which case i'm considering dropping the data into properly indexed tables when the ODS has been loaded - many of which will look the same as the original ERP tables, however i think that is a small price to pay if i can achieve abc.
So the questions;
a) Is this sensible -
b) The ERP tables do have primary keys - at least when viewing the data via an ODBC driver for the system i can see columns that are interpreted as pk and would appear to provide unique references for each record- however the properly normalised tables will have, in some cases, 5+ fields that define the PK and i am loathe to carry around that much data when i need to use them as a FK in another table and i'm doing a lot of involved queries. However if i generate a surrogate key [+ unique index on the natural key] i then need to populate the child tables with this key. I can see this being a somewhat messy business - any suggestions/opinions on how to approach this would be gratefully received.
c) At them moment i load the data at night when the ERP is not being used by dropping all data and indexes and then reloading the whole system. It takes about 90 mins. I don't need a true data warehouse with every single event recorded as our business model changes to often for it to make any sense; however i would be interested to know if there is any advantage to loading new data and then updating changes to existing records. From one perspective that would appear to require some cumbersome and time costly queries checking for changes to each piece of data in each record but it might mean i can load data at more frequent intervals. That in itself brings up many questions about locking and performance tuning.
To give some general idea the current ODS is 30 gig growing at around 1.5 gig a month the largest tables have ~1.5 million records and ~100 fields. I expect to get that down by 2/3rds with the normalised structure.
I've read a fair amount on creating DW's but this on one level does seem a somewhat unusual scenario.
My current preference/answers are.
a) yes
b) Surrogate keys on all tables - exceptions being many to many join tables unless they are used in another relationship - until i have some hard data that says i'd be better off with the natural key.
c) bite the bullet and update existing data and only load new data. Change the load frequency to every 1 hour to start with and then bring down the interval if possible.
Any comments/help very much appreciated
K.
August 23, 2007 at 5:47 am
This website provides a clean info on a datawarehouse design, just do a search using ODS as keyword;
As an example: http://www.sqlservercentral.com/columnists/vRainardi/2657.asp
I would highly recommend surrogate keys on ODS, and check constraints if possible (which means if the performance on data load processes is not hurt). Natural keys are recommended (and actually work fine) for Staging area.
Hope this helps.
August 23, 2007 at 8:20 am
I use surrogate key for all my dimension table.
As for updating the existing data and inserting new data, it is better to understand how the ODS works. Do they update the data or do they delete the data and insert the correct one?
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply