Performance problems on Live server

  • We have some very strange performance problems and we are struggling to get a handle on what might be wrong.

    We took a back up of the live databases and restored them on our test server. We developed some new code and ran this against the databases on the test server - no problems. We implemented the code into Live and started having problems. Examples include:

    - steps which take a few minutes on Test take hours on Live (if we have enough patience to let it complete - we gave up on one step after it had been running for 10 hours compared with a few minutes in test)

    - rerunning a step which took a few seconds to complete on one day took 3 minutes to complete a day later.

    Both Live and test servers have very similar set up: raid 5 array for the data; 2 Gb RAM on Test vs 6Gb on Live (AWS was enabled on Live but have turned that off presently to see if that was causing any problems); etc

    The Live server gets to a step and we see the CPU usage max out and stay there. Using sp_who2 we see the process gobbling up CPU but little or no I/O activity.

    I accept that we could tune our queries some more but the fact remains that the queries run fine (within acceptable time frames) in test (which is a restore of live) but don't run well on the Live server. I don't believe that the code or the amount of data (20 million rows in one table join with another two tables on 1 million rows each) is the problem as it runs on the test server.

    What should we be looking at?

    Jez

  • Update.

    I have split a proc with 6 steps into individual procs and still getting similar problems. However, if I copy the sql statement out the offending proc and execute it within a query analyser window it runs in 2 seconds.

    Every time go back and execute the proc with this statement it takes a lot longer (I normally kill it after 5 minutes).

    Is there anything different about running a query in QA rather than within an SP?

    Jez

  • I've seen similar issues and struggled with them. For some reason the query optimizer runs things differently in a proc at times.

    If possible, it helps to try and clear the plan that's cached for the sproc. A WITH RECOMPILE might help, but then you're compiling every time.

    Can you post a little code or let us know what the bad proc does?

    Also, are you sure indexes are up to date as well as stats. I'm assuming same data sizes in text and prod? Do you send in a wide variety of parameters to the proc?

  • This is the code that was causing us problems this morning:

    insert into account_revenue

    (

    Account_id,

    balance_date,

    revenue_type_code,

    no_of_trans,

    discount_value,

    revenue_at_standard_tariff,

    actual_revenue,

    ytd_actual_revenue,

    revenue_ccy_id,

    ytd_discount_value,

    ytd_revenue_at_standard_tariff,

    ytd_revenue_ccy_id,

    revenue_type_def_id,

    input_batchid

    )

    select

    previous_month.uni_account_id,

    @current_processing_date,

    previous_month.uni_revenue_type,

    0 as no_of_trans,

    0 as discount_value,

    0 as revenue_at_standard_tariff,

    0 as actual_revenue,

    case when datepart(month,@current_processing_date)= 1 --If JANUARY then YTD Revenue = 0

    then 0

    else previous_month.YTD_Actual_Revenue

    end as YTD_Actual_Revenue,

    previous_month.uni_act_rev_ccy,

    0 as ytd_discount_value,

    0 as ytd_revenue_at_standard_tariff,

    previous_month.Uni_YTD_Act_Rev_CCy, --v2.0

    15,--not sure what this should be?!

    previous_month.batchid

    from wealth_lk..mer_sbi_batch_revenue previous_month

    join account account on account.account_id = previous_month.uni_account_id

    and account.location_id = @uni_location_id

    and account.account_id < 30000000-- exclude adjustments

    left join wealth_lk..mer_sbi_batch_revenue this_month

    on this_month.uni_account_id = previous_month.uni_account_id

    and this_month.uni_revenue_type = previous_month.uni_revenue_type

    and this_month.month_date = @current_processing_date

    where previous_month.month_date = @previous_month_end

    and this_month.uni_account_id is null

    This is part of the execution plan:

    StmtText

    declare @current_processing_date datetime, @uni_location_id int, @previous_month_end datetime set @current_processing_date = '2007-10-31'

    set @uni_location_id = 104

    set @previous_month_end = '2007-09-30'

    insert into account_revenue ( Account_id, balance_date, revenue_type_code, no_of_trans, discount_value, revenue_at_standard_tariff, actual_revenue, ytd_actual_revenue, revenue_ccy_id, ytd_discount_value, ytd_revenue_at_standard_tariff, ytd_revenue_ccy_id, revenue_type_def_id, input_batchid ) select previous_month.uni_account_id, @current_processing_date, previous_month.uni_revenue_type, 0 as no_of_trans, 0 as discount_value, 0 as revenue_at_standard_tariff, 0 as actual_revenue, case when datepart(month,@current_processing_date)= 1 --If JANUARY then YTD Revenue = 0 then 0 else previous_month.YTD_Actual_Revenue end as YTD_Actual_Revenue, previous_month.uni_act_rev_ccy, 0 as ytd_discount_value, 0 as ytd_revenue_at_standard_tariff, previous_month.Uni_YTD_Act_Rev_CCy, --v2.0 15, --not sure what this should be?! previous_month.batchid from wealth_lk..mer_sbi_batch_revenue previous_month join acco

    |--Assert(WHERE:(If (NOT([Pass1027]) AND ([Expr1026] IS NULL)) then 0 else If (NOT([Pass1029]) AND ([Expr1028] IS NULL)) then 1 else If (NOT([Pass1031]) AND ([Expr1030] IS NULL)) then 2 else If (NOT([Pass1033]) AND ([Expr1032] IS NULL)) then 3 else NULL))

    |--Nested Loops(Left Semi Join, WHERE:(([previous_month].[Uni_Account_Id] IS NULL) OR ([@current_processing_date] IS NULL))OUTER REFERENCES:([previous_month].[Uni_Account_Id]), DEFINE:([Expr1032] = [PROBE VALUE]))

    |--Nested Loops(Left Semi Join, WHERE:(([previous_month].[Uni_Revenue_Type] IS NULL) OR (15 IS NULL))OUTER REFERENCES:([previous_month].[Uni_Revenue_Type], 15), DEFINE:([Expr1030] = [PROBE VALUE]))

    | |--Nested Loops(Left Semi Join, WHERE:([previous_month].[Uni_YTD_Act_Rev_CCy] IS NULL)OUTER REFERENCES:([previous_month].[Uni_YTD_Act_Rev_CCy]), DEFINE:([Expr1028] = [PROBE VALUE]))

    | | |--Nested Loops(Left Semi Join, WHERE:([previous_month].[Uni_Act_Rev_Ccy] IS NULL)OUTER REFERENCES:([previous_month].[Uni_Act_Rev_Ccy]), DEFINE:([Expr1026] = [PROBE VALUE]))

    | | | |--Table Insert(OBJECT:([WEALTH_DW].[dbo].[Account_Revenue]), SET:([Account_Revenue].[YTD_Revenue_at_Standard_Tariff]=RaiseIfNull(0.00), [Account_Revenue].[timestamp]=[Expr1013], [Account_Revenue].[Userid]=RaiseIfNull(user_id(NULL)), [Account_Revenue].[Last_Changed_Date]=RaiseIfNull(getdate()), [Account_Revenue].[Input_batchID]=[previous_month].[BatchId], [Account_Revenue].[Revenue_Type_Def_Id]=RaiseIfNull(15), [Account_Revenue].[YTD_Revenue_Ccy_Id]=[previous_month].[Uni_YTD_Act_Rev_CCy], [Account_Revenue].[YTD_Discount_Value]=0.00, [Account_Revenue].[Revenue_Ccy_Id]=RaiseIfNull([previous_month].[Uni_Act_Rev_Ccy]), [Account_Revenue].[YTD_Actual_Revenue]=[Expr1007], [Account_Revenue].[Actual_Revenue]=0.00, [Account_Revenue].[Revenue_at_Standard_Tariff]=0.00, [Account_Revenue].[Discount_Value]=0.00, [Account_Revenue].[No_Of_Trans]=0, [Account_Revenue].[Revenue_Type_Code]=[previous_month].[Uni_Revenue_Type], [Account_Revenue].[Balance_Date]=RaiseIfNull([@current_processing_date]), [Acc

    | | | | |--Top(ROWCOUNT est 0)

    | | | | |--Compute Scalar(DEFINE:([Expr1007]=If (datepart(month, [@current_processing_date])=1) then 0.00 else [previous_month].[YTD_Actual_Revenue], [Expr1022]=getdate(), [Expr1024]=user_id(NULL), [Expr1013]=gettimestamp(9)))

    | | | | |--Nested Loops(Inner Join, OUTER REFERENCES:([previous_month].[Uni_Account_Id]))

    | | | | |--Filter(WHERE:([this_month].[Uni_Account_Id]=NULL))

    | | | | | |--Merge Join(Right Outer Join, MERGE:([this_month].[Uni_Account_Id], [this_month].[Uni_Revenue_Type])=([previous_month].[Uni_Account_Id], [previous_month].[Uni_Revenue_Type]), RESIDUAL:([previous_month].[Uni_Account_Id]=[this_month].[Uni_Account_Id] AND [previous_month].[Uni_Revenue_Type]=[this_month].[Uni_Revenue_Type]))

    | | | | | |--Clustered Index Seek(OBJECT:([WEALTH_LK].[dbo].[Mer_Sbi_Batch_Revenue].[PK_Mer_Sbi_Batch_Revenue] AS [this_month]), SEEK:([this_month].[Month_Date]=[@current_processing_date]) ORDERED FORWARD)

    | | | | | |--Clustered Index Seek(OBJECT:([WEALTH_LK].[dbo].[Mer_Sbi_Batch_Revenue].[PK_Mer_Sbi_Batch_Revenue] AS [previous_month]), SEEK:([previous_month].[Month_Date]=[@previous_month_end]) ORDERED FORWARD)

    | | | | |--Index Seek(OBJECT:([WEALTH_DW].[dbo].[Account].[IK_Account_Location_Id] AS [account]), SEEK:([account].[Location_Id]=[@uni_location_id] AND [account].[Account_Id]=[previous_month].[Uni_Account_Id]), WHERE:([account].[Account_Id]<30000000) ORDERED FORWARD)

    | | | |--Row Count Spool

    | | | |--Clustered Index Seek(OBJECT:([WEALTH_DW].[dbo].[Currency].[PK_Currency]), SEEK:([Currency].[Currency_Id]=[previous_month].[Uni_Act_Rev_Ccy]) ORDERED FORWARD)

    | | |--Row Count Spool

    | | |--Clustered Index Seek(OBJECT:([WEALTH_DW].[dbo].[Currency].[PK_Currency]), SEEK:([Currency].[Currency_Id]=[previous_month].[Uni_YTD_Act_Rev_CCy]) ORDERED FORWARD)

    | |--Row Count Spool

    | |--Index Seek(OBJECT:([WEALTH_DW].[dbo].[Code_Table].[IK_Code_Table_Code_Def_Id]), SEEK:([Code_Table].[Code_Def_Id]=15 AND [Code_Table].[Code_Id]=[previous_month].[Uni_Revenue_Type]) ORDERED FORWARD)

    |--Row Count Spool

    |--Index Seek(OBJECT:([WEALTH_DW].[dbo].[Account_Balance].[PK_Account_Balance]), SEEK:([Account_Balance].[Account_Id]=[previous_month].[Uni_Account_Id] AND [Account_Balance].[Balance_Date]=[@current_processing_date]) ORDERED FORWARD)

    (and a bit more)

    PhysicalOpLogicalOp

    NULLNULL

    NULLNULL

    NULLNULL

    NULLNULL

    AssertAssert

    Nested LoopsLeft Semi Join

    Nested LoopsLeft Semi Join

    Nested LoopsLeft Semi Join

    Nested LoopsLeft Semi Join

    Table InsertInsert

    TopTop

    Compute ScalarCompute Scalar

    Nested LoopsInner Join

    FilterFilter

    Merge JoinRight Outer Join

    Clustered Index SeekClustered Index Seek

    Clustered Index SeekClustered Index Seek

    Index SeekIndex Seek

    Row Count SpoolLazy Spool

    Clustered Index SeekClustered Index Seek

    Row Count SpoolLazy Spool

    Clustered Index SeekClustered Index Seek

    Row Count SpoolLazy Spool

    Index SeekIndex Seek

    Row Count SpoolLazy Spool

    Index SeekIndex Seek

    We are processing for different source systems (@uni_location_id) for the current/previous month (@current_processing_date/@previous_month_end) in our warehouse and this procedure processes the data for a specific source system (specified by a parameter coming into the proc). There are about a dozen different source systems.

    We have WITH RECOMPILE in the proc header already - as it is a warehouse, the overhead of recompiling the proc each time is insignificant. The proc might take between 1 and 10 minutes to run depending on which source system we are processing (some are much bigger universe of data than others) and a second to recompile is not worth worrying about. The size of the data does vary from a few thousand rows to half a million depending on the source system and SQL Server might choose a different execution plan for different sizes of data - hence the recompile I guess.

    We haven't rebuilt the indexes on the tables in this particular query - there was another query that was causing us problems yesterday and rebuilt the index using DBCC DBREINDEX but this didn't have any effect. We can update the stats and see what happens.

    Currently, I have extracted all the T-SQL from the proc and pasted into QA and it runs fine.

    We have tried restoring the database onto a different raid array (mirrored rather than RAID 5) to see if there was a hardware problem with the array but it made no difference.

    What is so frustrating is that there seems to be no consistency in where is goes awry. There are 3 procs each with 7 or 8 steps and there is no single point at which it 'fails'. It also works 'fine' on the test server which was a restore of the live databases from the previous month so full size, same indexed, same statistics etc.

    Any more ideas?

    Jez

  • Ok, just a suggestion but have you taken a look at the patch levels of the Raid controller, mother board board bus, nics etc.

  • I'm not a server engineer so all of that is over my head. I cannot believe that it is something wrong with SQL Server as we have the same set up on the test server (sp4) so I can only imagine it is either the hardware on the prod server or a windows issue. Either way I don't know where to look. We have had a look at the array controllers and they report everything is OK (no failed disc, battery cache working etc) but apart from that I am lost.

    Jez

  • As you say it is not consistantly happening in the same place so I would get the engineers to compare the test and live hardware.

    You cant stuff a large sock down a small pipe quickly and that may be your issue here. The only other tests you can do is to put your Test system under load then run the proc against it, better simulating the live system.

    Also try the trial edition of Spotlight (Quest Software) to give you a visual indicator what is going on at both SQL and Windows level, I find it very usefull at times.

  • Hi,

    Have you tried running SQL Profiler on the server when the process is running. To confirm the statements you are expecting to run are in fact running.

    Have you also run sp_who2 when the application is running and check for locks. Could the application be somehow locking itself by not commiting transactions correctly?

    Hope the above helps!

    Pete

    Peter Gadsby
    Business Intelligence Consultant
    www.hgconsult.co.uk

  • We've tried running profiler but not had much sucess (no real experience as to what to record).

    We have run sp_who2 and it was the only process running on the box at the time and no locks. The process did split into parallel processes and for some of these the CPU time kept going up but with no DiskI/O change. I cannot think of what type of processing (from within the procedure) it would be doing which would cause the CPU usage to hit the roof. It's almost as if it has entered an infinite loop.

    Jez

  • I would try the maxdop option (maximum degree of parallelism) on your select query and see if that helps. That limits the query to use a specific number of processors. I would test it using 1 to start with. I've seen that make a huge difference with the type of problem you are describing. Sometimes a query will run faster when it's not running in parallel. I'm not sure why, but it's worth a shot. If I remember correctly, you put the command at the end of the query: "option (maxdop 1)" or something like that. See the "Option clause" in BOL. The degree of parallelism can also be set at the server level but that would affect evey query so be careful changing that.

  • So, double check one more time...

    All indexes (indluding primary keys) on the test box are also available on the production box.

    All tables have the same row count on both boxes.

    All tables have the same, if any, triggers.

    Both boxes have the same amount of memory available.

    Both boxes have the same amount of TempDB available.

    You're 100% sure the proc is identical on both boxes.

    Default fill factor on both boxes is identical.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    The database on the test server is a restore of live so the tables will have the same rowcount, the indexes are identical and the triggers (one to update the timestamp column and one to record details into an archive table) are the same.

    The production box was using AWE (~3Gb SQL ram) but the test box is using normal ram (~2Gb) - we have currently disabled AWE on the prod box and it is now using 2Gb ram so essentially the same.

    Tempdb availability is similar on both boxes - unlimited growth but in test the tempdb is on the same raid array as the database (raid5) whereas in live is is on a spearate array (raid 0+1).

    What is strange is that if I run the code from the sp in Query Analyser on the prod box it works fine - it completed in less than a minute compared to the sp which was still running after a couple of hours (it is reading 20 million row tables and producing 500,000 row results set so I never expect a sub second reponse). I had to do this so that we could complete the month end processing. When running the sp and using sp_who2, we see the CPUTime column go up and up and up but there is no DiskIO activity. This is what is confusing us because the code seems to be fine when submitted through QA but 'fails' when executed via a procedure.

    Jez

  • Are you being hit by "Parameter sniffing"? Search here for a description of typical problems attributed to this 'feature' and simple solutions to same.

  • Andrew has hit the nail on the head.

    I have amended the procs so that they use temporary (#) tables and local variables rather than the parameters passed into the proc and they all run much better.

    It still doesn't answer the question of why we didn't experience these problems on the test server - maybe the sequence in which we executed the procedures for the different platforms (and different numbers of rows affected) was different and generated different execution plans.

    It doesn't really matter as we have an answer to our performance problems. Parameter sniffing is definitley one to keep in mind in future.

    Thanks everyone for your ideas.

    Jez

  • As an alternative to local variables, I would suggest that you change the procs back to using parameters and mark each proc WITH RECOMPILE.

    With variables you'll get average performance, never bad, but also not the best you could have. If you recompile on every execution, you're guaranteed to get the best plan for each exec.

    There will be a cPU overhead for the recompiles, but it may be acceptable. Try it and see.

    p.s. Param sniffing is only a problem if different calls to the procs have different execution paths or affect wildly differing numbers of rows

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply