May 17, 2012 at 3:46 pm
Hi all,
I have this relatively simple (logic-wise) merge statement which we do every day to update the datawarehouse's dimension / fact tables from the raw ODS table, but it's taking so long that I would like to have some new idea on how to tune it better..
the general syntax of the merge statement is simply:
MERGE ZooskDataWarehouse.dbo.DimUser AS target
USING OdsUsers AS source
ON target.UserID = source.UserID
WHEN MATCHED AND target.LastUpdated < source.LastUpdated THEN
UPDATE SET xxxxxx
WHEN NOT MATCHED BY TARGET THEN
INSERT XXXXX ;
Basically, each day production (web site) would send DW (datawarehouse) all the users that have updated their account. I will then merge all the data into our ODS table (which stores all the user's data in its raw form)
then I merge ods to both DimUser (user dimension) and FactUser (User fact table).
Quite simple in logic. But the problem is the volume.
We have about 120 million rows of user data (so 120MM row for ODS, DIM and FACT user tables)
Each day, about 7 million rows of user data changes. Let's say most of them is modified existing user data that go through the update path.
The SQL itself is taking about an hour to run, even on fast CPU and SSD RAIDed disks, and i actually about 1/3 of the time of the over all process.
How can I make it more efficient?
What should I set my cluster index? (right now I only have userID, and a non cluster index with userid and lastupdated date)
Should I set up partitioned tables?? (problem is.. all users have a chance of being updated, so you can't really partition by time)
I already have the database set up with multiple files/filegroup
During the ETL, I can see that CPU / Memory utilization is very low. Disk usage is there, but I can't say it's peaked as well. So I suspect there's a lot of room for improvement in terms of efficiency
Another thought is to have a where clause on the SOURCE part of the MERGE. For example (select * from ODSUsers where LastUpdated > getdate() - 1)
Logically it makes sense because then I am only merging the newly updated users to dim. However, it actually turns out to be slow, mostly because the where clause is so inefficient
I also thought I would do a cluster index on last updated, so this select would be quick. But the index management become very inefficient, because of all the shuffling of the data when it's updated.
Any suggestions on how I can do better ? thanks
May 17, 2012 at 10:41 pm
Without the query plan its hard to give firm advice.
I would definately try reducing the rows going into the merge statement:
;WITH UpdatedOdsUsers as
(select * from OdsUsers
where LastUpdated > dateadd(dd, -1, getdate())
)
MERGE ZooskDataWarehouse.dbo.DimUser AS target
USING UpdatedOdsUsers AS source
ON target.UserID = source.UserID
WHEN MATCHED AND target.LastUpdated < source.LastUpdated THEN
UPDATE SET xxxxxx
WHEN NOT MATCHED BY TARGET THEN
INSERT XXXXX ;
You may want to split the MERGE back into a two-step process, as it's easier the tackle the performance of each individually.
MERGE is handy, but hard to tune.
INSERT DimUser (UserId, c1, c2, c3) select UserId, '', '', 0 --defaults!
from OdsUser where userId not in (select UserId from DimUser)
--Optimize the insert by having non-clustered indexes
--on UserId on both tables. By inserting default values, IO reads
--are greatly reduced.
UPDATE DimUser SET XXXXXX FROM DimUser du INNER JOIN OdsUser ou on ou.UserId = du.UserId WHERE du.LastUpdated < ou.LastUpdated and ou.LastUpdated > datadd(dd, -1, getdate())
How many other indexes are on these tables? It may be quicker to drop/recreate the non-clustered indexes vs leaving them intact.
If you can post the plan, it would really help.
thanks!
May 18, 2012 at 1:36 am
120 million rows are quite a huge number. Have you ever thought of parttioning the table?
say e.g. you have around 8 parttiton Then you could start say 8 session with each session inserting the 900k rows in each partition based on partition key.
Even if you drop indexes to load it faster.Recreating indexes on partitioned tables should be fast..
But again see whether partitioning is suited to you and provide the perfroamnce you are looking.
GulliMeel
Finding top n Worst Performing queries[/url]
Improve the performance of Merge Join(special case)
How to Post Performance Problem -Gail Shaw[/url]
May 18, 2012 at 5:49 am
You'd need to show the actual execution plan in order to make any guesses at what's needed to tune the query. One suggestion, see if there are ways to limit the initial data set to only the data that has been touched over the preceding 24 hours. Then the MERGE will be working with less data. That'll help. Otherwise, no guesses without some idea of the structure and how the query is executing.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
May 18, 2012 at 7:53 am
The problem very likely has nothing to do with the number of rows being compared. It's probably the number of rows being updated.
I don't have a definitive test to show you but I did in the past. On my test box, it took scant seconds (like 4) to update a million rows in a much larger table. When I doubled the number of rows, it only doubled the amount of time, just like expected. But, when I tripled the number of rows, instead of it only taking 3 times linger (12 seconds in this case), it took more than 20 minutes.
I call these things "tipping points" and every server has them. The size may vary, but they all have them.
My recommendation would be to breakup the MERGE so that you're only working with a million rows at a time. 7 updates of a million rows each in a total of 28 seconds (a guess, for sure) is going to be a lot better than your hour long run.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 18, 2012 at 9:13 am
You say ETL... are you using SSIS for the ETL?
If so, you can sometimes outperform merge by using the constructs within SSIS to mimic a merge.
Using a lookup to see if the record already exists... if it doesn't do an insert....
Do a second lookup to see if the record has changed... if it has... do an update....
else just toss out the record.
With my larger DW tables (250 Million rows) that seems to give better performance than the merge, I suspect memory pressure, but our DW server is always under memory pressure these days (looking forward to our nice new upgrade).
May 18, 2012 at 9:43 am
mtassin (5/18/2012)
With my larger DW tables (250 Million rows) that seems to give better performance than the merge...
A race would be interesting.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 7 posts - 1 through 6 (of 6 total)
You must be logged in to reply to this topic. Login to reply