April 23, 2015 at 5:36 am
Ok the question sounds bit confusionig but let me give you background of business first.
In one of my clients business I have to daily file-in - file-out activity, which means on daily basis they receive multiple types of files (excel, text, mdb etc) from different source (SAP, Orcle, DB2). Its nothing but a simple data dump from different source systems. If I have to talked about data volume it would be approx 10 millions rows after consolidation from all source.
This data further gets normalized & dumped into respective "Normalized Tables". Now Operations Team (having some SQL knowledge) works on this data & do some data analysis. Once we get go ahead from Operations team I pushed data into OLAP environment (Cubes) so that the upper management can do slice& dice on data as per their requirements. File-in-file-out activity is daily activity & same as consolidation & so on.
Currently, overall process took a "DAY" or more. Now I need to revisit existing design and have to come up with new one. Also the expectation is if Operations team does any data modification to existing or new dataset from back-end it should reflect on OLAP reports as well with minimal latency. Considering the data volume being updated in current scenario it is difficult to manage refresh activity in a day (incremental or full load).
Most of the scenarios Operations team does "BULK INSERT / UPDATE" which has to go through "FULL" refresh cycle, which is painful process as of now.
I seek for expertise suggestions that would help me achieve come up with some design that should have minimal latency in data consolidation & refresh cycle.
Abhijit - http://abhijitmore.wordpress.com
April 23, 2015 at 8:10 am
I'm no expert, but are there any issues with spreading out the refresh cycles per data source or do you have to do everything at once in one huge refresh cycle? For example, multiple data sources are sometimes on different refresh cycles which dictates when its data is periodically updated in the DW.
The other suggestion would be to do as much as possible down to the disk level as possible before the BULK INSERT. This is something I do with large flatfiles before BULK INSERTING them into my system that goes through similar processes. Python is something that helps me a great deal in achieving that transformation on the disk level.
April 23, 2015 at 8:15 am
you are correct! I have periodic data modification which leads us for full "HUGE" refresh or full fact load. We have certain scenarios where we apply incremental load (when data volume is low).
Abhijit - http://abhijitmore.wordpress.com
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply