October 25, 2015 at 7:19 pm
Kirsten : not sure if the flexibility afforded by bringing in columns that one day might be useful offsets the storage and performance overhead of all the rows and columns that won't.either way it might be worth looking into how many of such columns contain sensitive data and if there are any meaningful reports based on them.
Also I just wanted to check if you're using CDC for your etl process since you mentioned you need to do net change.
October 25, 2015 at 8:01 pm
Jayanth_Kurup (10/25/2015)
Kirsten : not sure if the flexibility afforded by bringing in columns that one day might be useful offsets the storage and performance overhead of all the rows and columns that won't.either way it might be worth looking into how many of such columns contain sensitive data and if there are any meaningful reports based on them.Also I just wanted to check if you're using CDC for your etl process since you mentioned you need to do net change.
Yeah, but that's normally not your call. I have similar where the business won't allow to not import all data because the users may eventually want them one day. That's even when that day is not today. And likewise, it's hard to re-import and update 1 billion+ rows last minute (as it always seems the case for the end user) for just for those few columns we withheld.
As long as the end user is willing to accept the costs and or is your boss, there is nothing more you can do and everything will be just fine. 🙂
October 25, 2015 at 8:37 pm
October 26, 2015 at 3:52 am
Jayanth_Kurup (10/25/2015)
not sure if the flexibility afforded by bringing in columns that one day might be useful offsets the storage and performance overhead of all the rows and columns that won't
I'm sure it varies for other folk 🙂 For us its normally a third party providing the initial data, and our expectation is that our client will change their mind / have given us poor information when the Spec was created ... The cost & hassle of getting more columns from the supplier is typically huge. So we find it easier to ask for "all columns" (with some caveats 🙂 ) which solves that. Otherwise it is always us that the client thinks is at fault - "Why are we having to pay the supplier more money?" to which the answer "'Coz you never told us you wanted all that extra data" never seems to be the right answer!
columns contain sensitive data and if there are any meaningful reports based on them
Yes, we are selective but we do include anything that "Might be of interest"
Also I just wanted to check if you're using CDC for your etl process since you mentioned you need to do net change.
No, we are not using CDC. Many sources are e.g. MySQL and the ones that are MSSQL are stone-age (client slow to upgrade and/or APP is slow to change), vendors unlikely to understand what CDC is! and our Client likely to have cheapest license that doesn't include the ability to directly query data, let alone anything fancy like being allowed to modify the database at all. Also, none of our clients have Enterprise licenses ... and so the list goes on! We make a living getting in the middle, aggregating data more easily than competitors, and by being able to do it efficiently (low grade hardware) and with the ability to do that whatever the source of the client's data is. I expect we could do it all with SSIS ... but we don't, we use our own, mechanically generated, SQL code to do all the donkey work, but (from the Client's perspective) that includes good data-quality checking, built into our APP so they don't have to plough through alien-looking reports, and if they can't be bothered to do anything [usually the case 🙁 ] then warnings show up on the effected records. Client staff of office clerical grade are able to use our APP, diagnose data issues, and resolve them by themselves in most cases - again, avoiding significant ongoing costs. We'd prefer to be writing new software rather than supporting day-to-day data issues for clients.
Jayanth_Kurup (10/25/2015)
Yeah , if the end user is willing to agree to the cost it shouldn't matter.
I think its cheaper. IME getting suppliers to modify their data exports is hugely expensive - cynically: I think they sell-cheap and expect modification charges.
October 26, 2015 at 5:56 am
The approach of fetching data that might not be needed currently is very common almost to the point of being an industry standard. I personally followed this approach for almost 10 years mainly coz when it came to data I was greedy just like everybody else. There more the better and its always better to have than not. I realized the impact only when i started consulting and being called in to optimize DW that have started becoming sluggish under the load of all this data that never gets used.
Would you believe that I was able to convert an ETL process that took 20 hours to under 30 minutes after we redesigned the system from the bottom up, got rid of everything that they felt they needed but never saw the light of day. Today the company advertise their reports as realtime 😉
Personally I have seen out of a hundred columns being imported , less than half actually ever make it till the report , If you have a data dictionary try this out try mapping source to output, the number might be interesting and a great way to advertise the efficiency of your BI process.
October 26, 2015 at 7:06 am
Interesting point, and I can see how reducing data volume would improve performance.
For us, we concentrate on only transferring rows that have changed. Network transmission time is our slowest component (usually from Web to Client's Office using an ADSL line or similar, and data going the other way, UP to the web, is subject to the Upload speed which is often much slower than the download speed). I doubt that even 0.1% of rows change in a day.
But even with that we do have some long-running processes (Transform times will probably be unchanged as by that stage we are only handling columns-of-interest and rows-that-have-changed, although of course if the source tables are bloated with unnecessary columns the I/O may be greater than if the source tables were "skinny". Not sure how significant an improvement that would be, but as you say it would be worth the experiment).
We could adopt a step of refining our process, once it is signed off in DEV by client as "Meets requirement" and then purge all the unused columns, to minimise the size of data in-transit and stored, but its extra cost and our attitude has been "if it runs in an acceptable time-window leave it alone". Of course if it ran in a fraction of that time then there would be lots of time for anything that went haywire to still run and complete easily before the users that need the data arrive for work.
October 26, 2015 at 7:37 am
I'd keep staging in a separate database, because that gives you the flexibility to set whatever recovery model, backup schedule, log truncation, and other database options are optimal for staging without interfering with the normal operation of the production data warehouse. Having staging and DW in the same database, I believe you would find yourself having to make too many compromises and trade offs.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
Viewing 7 posts - 16 through 21 (of 21 total)
You must be logged in to reply to this topic. Login to reply