March 15, 2011 at 6:43 am
All
I have a conceptual question. Currently my live environment is copied back to a staging area overnight for my datawarehouse, and my OLAP cube is built early morning from constructs of that data.
I want to update my cube throughout the day now, by taking a union of the "contacts fact" table and union that with a "contacts fact" view on the live environment.
The staging table has 1 million + rows, and the live table something like 2000 rows at any given time. Running a query on the live environment takes less than a second and I am happy with that overhead, as it has minimal impact to the users of the system.
However, if I were to union those 2 tables, and then re process the cube every hour, would the query to pull that data from the then combined fact contact table still only have a < 1 second overhead on the live environment? or does the union mean that both environments will be polled for the same length of time.
I guess if that is the case I can simply copy the live data to a temporary table and union that table with the one in my staging area...?
Any advice would be appreciated. Thanks in advance!
March 15, 2011 at 6:53 am
A simple "Union" is likely to degrade performance because of the sort and dedupe action in it. If you mean "union all", then that probably won't degrade performance.
I would tend towards keeping the warehouse data isolated and not use a view of production data into it. Otherwise, you're kind of defeating the whole purpose of a separate warehouse (query isolation so as to not degrade OLTP insert/update/delete performance or OLAP query performance). I'd be more inclined towards inserting data into a staging table periodically throughout the day, and using that to extend the warehouse table.
However, even that will suffer at least a bit with a simple "union", as opposed to "union all".
Edit: I guess the question is, why not just periodically add incrementally to the warehouse table(s)?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 15, 2011 at 6:56 am
What about doing insert into where not exists?
That query should fly, even if there's 1 M rows in the big table.
Also I think his question was referring to the locking of the prod db. I'm not sure a union select would take a very long lock in this situation...
March 15, 2011 at 7:00 am
Ninja's_RGR'us (3/15/2011)
What about doing insert into where not exists?That query should fly, even if there's 1 M rows in the big table.
Also I think his question was referring to the locking of the prod db. I'm not sure a union select would take a very long lock in this situation...
UNION between two tables can result in a table lock on each, since it has to sort and dedupe between the two. Given large tables, it can lock things for a long time in some cases. Given snapshot isolation or some comparable solution, that can be mitigated/eliminated, but it can take significant tuning to accomplish correctly.
UNION ALL is generally a shorter lock, since it just appends data, and doesn't have the sort and dedupe overhead.
Both of those statements are general. Actual mileage may vary.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
March 15, 2011 at 7:13 am
Thanks guys - both suggestions will work and from your advice, I think i'll stay away from the UNION approach and just plump with either a staging table or insert into where not exists.
Appreciated borrowing your expertise! 🙂
March 15, 2011 at 7:15 am
ProKelly (3/15/2011)
Thanks guys - both suggestions will work and from your advice, I think i'll stay away from the UNION approach and just plump with either a staging table or insert into where not exists.Appreciated borrowing your expertise! 🙂
Test all solutions. It might be even faster (prod side) to simply insert all 2000 rows into temp table and then do the not exists from dev. That way you have almost 0 effect on prod.
Viewing 6 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply