February 20, 2006 at 1:57 am
Dear Members hi,
I would like to thank you for the support that you provide to guys that are newcomers to SQL server.. I have an easy question to ask, but need your advise.
I am refining (repairing) an existing MIS system for a company.
The AS -IS situation is like this:
In short, the daily-operational system (namely Xelis), feeds a data-staging area. Then the data is presented in relevant datamarts. The procedure is that all data of the staging area's table gets truncated every night and the new data (obviously large amount of the operational system data remains the same) is reloaded to tables staging area. Is this right (or better, efficient)? I have my doubts. Shouldn't only the updated information been loaded? Why truncate every day and reload? (However, i believe this is affordable since the Xelis database isn't considered big (around 200.000 records the largest table). Please give me your opinion on the above.
So, all the DTS package does, is that it truncates data staging tables and then reloads the data (both changed and unchanged). However, the SQL code that does this, is written inside each task in the DTS package, instead of the tasks only executing a procedure. Is this efficient? (I believe not). Should'nt the SQL code been written in procedures, and then DTS tasks to execute each procedure?? Please comment.
Thank you for your time,
February 20, 2006 at 6:19 am
There are many methods with their own advantages and disadvantages when populating datamarts, some forced upon the developer.
In terms of populating target tables, you need to be able to identify deletions, additions and changes if you only want to update the target table. You will need an update column on the source to do this.
If you can do this, then the next consideration is what to do with deletions. Do you want to remove from target or flag as deleted. Changes are worse. Do you want to replace with the new entry or do you want end flag the current record and create a new. This dilemna is known as a slowly changing dimension. Search the web for many ways in dealing with this phenomenon.
As for updating the target, dts in my opinion is not the best way to populate targets. You can use a data driven query task but they are very slow.
Best way is to use stored procedures where you can comment code in for business rules in populating databases.
If you want to research in more detail, check out this site for some good research.
February 20, 2006 at 7:59 am
February 20, 2006 at 8:29 am
>>Should'nt the SQL code been written in procedures, and then DTS tasks to execute each procedure?? Please comment.
I would also say yes, for the reasons already mentioned, but also for the purposes of source code control & versioning.
Ever tried to find the difference between a DTS package on 1 server and another ? I'd rather have my SQL in stored procs, inside Visual Source Safe where I can track versions and determine what has changed between versions.
February 20, 2006 at 8:35 am
I aggree with both of you, and i much appreciate your help.. However, when i asked the MIS supervisor why the code is written within the DTS package he said that this way the "migration" performs faster than if the DTS package was executing a procedure!
i have reason to doubt this, but can this be true?
February 20, 2006 at 8:41 am
>>MIS supervisor
I have 1 of those too, who believed the same thing, because the highly paid "suits" who were consulted to build our 1st datamart said they'd had "problems" with stored procs at another job site.
As soon as we got the $200/hour goons out of here, I re-wrote most of it in stored procs and it runs in 6 hours instead of 36.
February 20, 2006 at 8:45 am
February 20, 2006 at 9:29 am
you have any openings at $100 an hour? ha!
February 21, 2006 at 8:46 am
I prefer the original model over the one you are proposing for a few reasons:
1) Simplicity of the load routine. By not concerning yourself with whether data has been added, changed, or deleted, your load logic becomes very simplistic.
2) synching and extending warehouse. Yes a majority of the data is unchanged, however by addressing all the data everytime you ensure that the warehouse never becomes out of sync with the OLTP system(s). Also, future extensions to the warehouse schema can be easily accommodated, whereas the other approach requires additional load processes to be enlisted when new fields are added/changed/deleted in the warehouse table(s).
3) Deployment and change management. By keeping the load logic SQL in your DTS package you've isolated all logic needed for the build process. I believe this to be a good thing. It's easy to manage, easy to deploy, and easily configurable with DTS global variables. Change history is contained within the package and is can be rolled back to any previous version. The package itself can be saved locally as a .dts file and uploaded to your source control system. Stored procedure objects have no implicit versioning so deployment and change management would have to be handled externally with scripts.
4) Latency. By tweaking the model slightly you can ensure the warehouse is always available. Instead of starting the process by truncating the tables, I start by building a copy of the warehouse tables and then if the build succeeds and is verified I drop the current tables and rename the copies, that way the system is always (almost) available for reporting. Under your proposed model there's a latency period that exists while your load logic is running. This could be quite large if you are dealing with a lot of changes/new data or your logic is complex.
5) Indexing. By addressing the load as I layed out in #4 you then can apply your indexes to the tables after the data has been loaded. Under your model (and actually the original model without my tweak in #4) you are loading data to tables with indexes. This can be very inefficient at load time. Obviously you could drop indexes before the load and then rebuild after but that's another step to your logic.
6) Portability. If all logic is contained within the DTS package then you aren't bound to a specific instance of SQL Server. Therefore, you could easily populate multiple databases/servers with your warehouse. Under you model you would have to publish all of you stored procedures to the separate databases/servers to allow for multiple builds. This could become clumbsy and easily out of sycn without some additional management structures.
Although some of this is just implementation details and really doesn't play a role in the "what's better" argument, I thought I would give an alternative view. I've implemented many warehouses with this model and I find it to be very flexible and stable.
February 21, 2006 at 8:56 am
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply