June 9, 2008 at 8:03 am
I have 2 identical Hardware servers. They were bot purchased and set up the same exact way. They are HP DL 580 G5, 64-bit EM64t/x64 with 16 CPU's running at 2.9GHz with 64 GB RAM. Both have SQL Server 2005 installed. SQL Server has a max memory setting of 52GB. Both servers have a database called OLAP which were created identical and have been loaded with the same backup copy of a SQL Server 2000 Database.
I am executing a not so well written query on each server. I am not looking at tuning this query, it is software generated. But on one server the query returns in 1 second. On the second server it return in 10 minutes and 30 seconds. The execution plan is different. I am at a total loss here.
This is the Query
SELECT F.DATASOURCE_NUM_ID, F.INTEGRATION_ID, F.ROW_WIDFROM WC_EPCR_BOOKING_DETAIL_F F (NOLOCK) JOIN WC_EPCR_PARAM_G P (NOLOCK) ON 1=1 WHERE F.ORDER_DATE_WID >= P.THIRTY_DAYS_AGO_START_WID AND NOT EXISTS ( SELECT FS.CTRLNUM FROM WC_EPCR_BOOKING_DETAIL_SEATS_FS FS (NOLOCK) WHERE F.INTEGRATION_ID = FS.SOURCE+UPPER(FS.CTRLNUM)+CONVERT(VARCHAR(5), FS.LINE_NO))
The server that executes this in one second has an execution plan of:
Select ----- Nested Loops ------ Nested Loops ------------ Nested Loops ------------ Table Scan 23%
| | |
| | Nested Loops ---------- Compute Scalar ------- Constant Scan
| RID Lookup 23% |
| Index Seek 23%
Top ------- Table Scan 32%
The server that executes this in 10 min 30 seconda has an execution plan of:
Select ------------ Nested Loops 11% ---------------------- Nested Loops 11% ------------------ Table Scan
Left Anti Semi Join Inner Join
| |
| Nested Loops 0% ------------- Compute Scaler 0% --------------- Constant Scan 0%
| |
| Index Seek 19%
|
Top 21% ----- Table Scan 39%
The three tables have:
WC_EPCR_BOOKING_DETAIL_F - 2,108,463 rows
WC_BOOKING_DETAIL_SEATS_FS - 37,008 rows
WC_EPCR_PARAM_G - 1 row (Yes, that is ONE row)
I realize there is probably not enough infor here to give me a 100% reason why. But if someone can tell me what to look for, what to lookat, etc. to start figuring out there is such a major difference, I would appreciate it.
All the SQL Server configuration parameters are EXACTLY the same.
I did notice that the one thing different is the Statistics are very differetn on each of the 3 tables. I have run Update statistic severla times with FULL SCAN on both servers.
Disk drives are all in a SAN and have been configured exactly the same.
Separate LUNS for TEMP Database, Data Files, Log Files, Operating system, SQL Server Installation and binaries, pagefile, backups
C: Operating system RAID 1
D: SQL Server Install - RAID 1
M: TEMP Database - RAID 10
L: Logs - RAID 1
S: Data - RAID 10
Z: Backups - RAID 5
P: and F: Pagefile - RAID 0
Database is roughly 300GB
June 9, 2008 at 8:36 am
SQL Server queries are not "deterministic" in terms of their execution plans.
First thing I'd do is rewrite that query a bit, then I'd worry about different execution plans AFTER I'd rewritten it.
Maybe something like:
SELECT F.Datasource_Num_ID, F.Integration_ID, F.Row_WID
FROM WC_EPCR_Booking_Detail_F F with(nolock)
INNER JOIN WC_EPCR_Param_G P with(nolock)
ON F.Order_Date_WID >= P.Thirty_Days_Ago_Start_WID
LEFT OUTER JOIN WC_EPCR_Booking_Detail_Seats_FS FS with(nolock)
ON F.Integration_ID = FS.Source + UPPER(FS.CtrlNum)+CONVERT(varchar(5), FS.Line_No)
WHERE FS.CtrlNum is null
You might also want to see about creating a calculated column in WC_EPCR_Booking_Detail_Seats_FS that has Source + UPPER(CtrlNum) + Convert(varchar(5), Line_No) in it, and index that column and CtrlNum. Then you can change that part of the join statement to the calculated column, and you'll get a much faster query. (Another option is an indexed view of the calculated column and CtrlNum.)
After making changes like that, then I'd check the execution plans again.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
June 9, 2008 at 10:36 am
I hate replying to postings that were cross-posted, but in this case, I am going to.
Hardware is not the only thing SQL Server uses to determine execution plans. You have restored the same database to both servers - has data changed on either of them differently in any way? Are there other databases? What else is running on these servers taking resources?
But, before any of that is looked at, you are getting a left anti-semi join in a place in which a loop join is more efficient. That screams statistics.
But, as previously posted - why do you care that the execution plans of a query that you know are inefficient don't match? That's a bit like caring that the tire is flat on your car after it gets hit by a truck at 60 mph.
June 9, 2008 at 11:03 am
I appreciate your comments more than you know. We just purchesed these two servers. We made them Identical. Restored each databse from teh same backup file. Databases are exactly the same. Same number of rows, same size, same options, and same data.
NOTHING else is running on these servers excpet SQL Server 2005.
Before we put them into production, I will be tuning some of these queries. But right now I expect both servers to behave exactly the same, otherwise, something is different.
That is where I am having a hard time. Figuring out what could possibly be different. Same data, same query, same hardware, same CPU Speed, same LUN speed, same EMC disks, no network involved.
Obviously something is different. Wish I knew what it was.
I could see maybe a few seconds different here and there, but 1 second to 10 minutes and 10 secs??? Both queries return NOTHING. There are no matches. Doing a lot of work to return nothing.
I do expect no matches. No new data has been entered so when this query is looking for changes, it better find nothing.
But like I said, 1 second on one server and 10 minutes and 30 seconds on the other along with a different execution plan. WHY?????
June 9, 2008 at 11:15 am
Flush your procedure cache. Update all of your statistics. Like I said, the statistics are pretty likely to be your issue.
You need to update statistics after upgrading a SQL 2000 database anyway.
As soon as you turn on the servers, the differences start building up. If you run a SELECT statement on one server and not on the other, you could have cached a plan and changed the memory pool slightly.
Are you positive that the database restores did not fragment a little and they are not stored differently on disk (making some pages harder to get to)? You have no control over this and it did happen.
It is pretty much impossible to compare servers as you are trying to do. The fact that there is a huge difference is actually good news. It means that there is something that the optimizer does not know on the slow server that can instantly help performance. This should help you narrow it down pretty quickly. However, the best plan is to re-write the query in a way that the optimizer will always have a much better chance of getting a really good plan. Since you know something you have little control over can impact your performance greatly, you should be focusing on making sure the query itself is optimized and the database engine will typically be ok.
June 9, 2008 at 11:19 am
The query you sent me runs in 1 second on both servers. Very nice and I thank you very much for that.
Also, the execution plans are the same.
But as you know, I am still wondering why that other query acts differently.
Thank you again
June 9, 2008 at 11:28 am
As mentioned, there is no guarantee that two instances of SQL Server, even running on the exact same server, will return the same execution plan for the same query. There's so much going on in query plan building that a badly written query can result in a million different execution plans, and all but one of them will perform horribly.
Well-written queries are much more predictable. After all, "well-written" means "gives the server the best solution very easily". The whole point of a well-written query is that it clearly communicates to the server what to do, and it takes advantage of the way the server engine works to get good results.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
June 9, 2008 at 12:11 pm
I did update usage for the entire databse, did an update statistics with full scan, and also did a rebuild of all teh indexes.
It did help. That "BAD" query now, instead of running in 10 minutes and 30 seconds, now runs in 10 minutes and 12 seconds.
Okay, so I guess that wasn't it.
At this point I think I am just going to tune the query and leave it at that and try and accept the fact that I will never determine why there is such a difference with this bad query from one server to another. Especially since the timing of about 150 other queries are exactly the same.
I do appreciate everyone's comments and suggestions. I especially appreciate the re-written SQL code so it was more efficient. I was amazed at the performance.
If you really feel up to a challenge, here a nightmare of a query: I cannot seem to get this query to run in less than an hour. Now if you can tune this beast so it runs in less than 15 minutes, just let me know where to send teh check and for what amount LOL
SELECT
CHF.CONTACT_WID , COUNT(DISTINCT CHF.SOURCE_WID) , COUNT(DISTINCT CASE WHEN POH.OFFER_WID IS NULL THEN 0 ELSE POH.OFFER_WID END) - SUM (CASE WHEN (POH.OFFER_WID = 0 OR POH.OFFER_WID IS NULL) THEN 1 ELSE 0 END)
FROM
W_CAMP_HIST_F CHF LEFT OUTER JOIN W_SRC_OFFR_H POH ON CHF.SOURCE_WID = POH.SOURCE_WID ,
W_PARAM_G PARAM
WHERE
CHF.CONTACT_WID > 0 AND CHF.SOURCE_WID> 0 AND (CHF.ETL_PROC_WID = PARAM.ETL_PROC_WID OR POH.ETL_PROC_WID = PARAM.ETL_PROC_WID) GROUP BY CONTACT_WID
June 9, 2008 at 12:33 pm
First problem with your query is mixing join styles. You use the ANSI Standard for the LEFT OUTER JOIN, but keep the old style join for your INNER JOIN. Makes the code difficult to read.
Second, you have provided no tabble DDL, sample data, expected output, current execution plan. Not much to work on if you are serious about this query being a "killer".
😎
June 9, 2008 at 1:12 pm
My first question is why is W_PARAM_G even in that query? Let's ignore the mixed join criteria for a moment. It has no columns in the Select or Group By, and does nothing except join to another table. Is it merely in there to multiple the rows for some reason?
If you can take that out, it should improve performance by a bit.
Next, try replacing:
COUNT(DISTINCT CASE WHEN POH.OFFER_WID IS NULL THEN 0 ELSE POH.OFFER_WID END)
With:
COUNT(DISTINCT isnull(POH.OFFER_WID, 0))
After that, what is the desired end result of this part:
- SUM (CASE WHEN (POH.OFFER_WID = 0 OR POH.OFFER_WID IS NULL) THEN 1 ELSE 0 END)
It seems to be in there to eliminate the count where OFFER_WID is either null or 0. In which case, wouldn't it be better to just get rid of those rows in the Where clause?
If those assumptions are correct, try this:
SELECT CHF.CONTACT_WID ,
COUNT(DISTINCT CHF.SOURCE_WID) ,
COUNT(DISTINCT POH.OFFER_WID)
FROM W_CAMP_HIST_F CHF
LEFT OUTER JOIN W_SRC_OFFR_H POH
ON CHF.SOURCE_WID = POH.SOURCE_WID
WHERE CHF.CONTACT_WID > 0
AND CHF.SOURCE_WID> 0
and POH.OFFER_WID != 0
GROUP BY CONTACT_WID
If you can, index Contact_WID, Source_WID in W_CAMP_HIST_F, and Offer_WID, Source_WID in W_SRC_OFFR_H. That'll depend on transaction volume and other indexes on those tables, of course. (I'm assuming these are tables, not views.)
If these are views, you'll need to make sure you aren't selecting from views that use tables you don't need in this query.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
June 9, 2008 at 3:18 pm
This pretty good, it returns in 16 minutes, real nice improvement. You gave it a good shot, but the results are wrong because I need it to use the W_PARAM_G.ETL_PROC_WID column to get teh correct results
June 9, 2008 at 3:26 pm
This pretty good, it returns in 16 minutes, real nice improvement. You gave it a good shot, but the results are wrong because I need it to use the W_PARAM_G.ETL_PROC_WID column to get teh correct results
June 9, 2008 at 4:00 pm
The WHERE clause uses the PARAM table to compare the ETL_PROC_WID is in one table or the other to see if it is a change to an order.
If a customer purchased more seats or added more to their existing order, we need to know if it is a change.
Hope that makes sense
I can give you anything you need if it will help.
I really do appreciate all this help. I am amazed how you can do this just by looking at the query.
I do thank you very very much
June 9, 2008 at 4:37 pm
You'd be really amazed what could be done if you provided the DDL of the tables, sample data, expected results based on the sample data. The more information provided, the better the answer you will receive in return.
😎
June 9, 2008 at 7:37 pm
I saved the execution plan into a file, but I do not see any way of including it in here.
The following queries return the counts shown below:
select count(*) from W_SRC_OFFR_H;
select count(*) from W_CAMP_HIST_F;
select count(*) from W_PARAM_G;
11928
63294928
123
EXECUTION PLAN IS:
select -- Compute -- Compute -- Merge Join -- Compute -- Stream --- 2
0% Scaler 0% Scaler 0% Inner Join 0% Scaler 0% Aggreg 0%
|
|
4
2 ---- Sort distinct 0% -- Table Spool -- Nested Loop -- Compute ---- 3
Eager 0% Inner 0% Scaler 0%
|
|
5
3 ---- Hash Match 17% -- Table Scan 0%
Outer Join W_SRC_OFFER_H
|
+------------ Table Scan 64%
W_CAMP_HIST
4 ------ Merge -- Compute --- Stream --- Sort -- Table Spool
Join 0% Scaler 0% Aggrg 0% 0% Eager 33%
|
|
+--- Compute -- Compute -- Stream ---- Dist -- Table Spool 33%
Scaler 0% Scaler 0% Aggreg 0% Sort 0% Eager *
5 -------- Table Scan 3%
W_PARAM_G
* Eager Spool - W_CAMP_HIST_F for CONTACT_WIDE and SOURCE_WID
I hope this helps. If you need more info, please let me know
Thank you all again
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply