December 16, 2010 at 3:08 am
Hi, everybody. I have a problem optimizing my stored procedure, but after a long research on cursors vs. set based approach, I'm still not sure how to solve it.
Here's the task: I have a table with connection logs. There are 3 types of records: start (type=2), interim (type=3) and stop (type=4), and I have to create a single record by joining start leg with each interim and stop leg.
We receive logs in text file, which we import into table and then run a procedure over it.
I've created a query that gives me a result set with start and stop parts of the record
select * from (select * from log where msgtype=2) A inner join (select * from log where msgtype>2) B on A.SessionID=B.SessionID where A.FileID=XY or B.FileID=XY
The procedure looks like this
declare cursor for QUERY_ABOVE
while
do the procesing and formatting the output record
insert output record into result table
update: mark original records as joined
fetch next
The problem is that the files we import are getting bigger and bigger, and currently, the procedure needs 15 minutes to process 130.000 records, so I've started looking for solutions. After short debugging, I've realized that browsing through cursor simply takes a lot of time, and that I can't improve much on current logic, and reading the articles on cursors vs. set based approach, it became clear that I should rewrite the whole thing. BUT, I just don't see how could I put all the formatting logic from procedure into a query (which is what I should do, if I'm not wrong)?
The output record is consisted of 20+ fields with fixed length, and the content of each field is calculated from the various columns from the original table on different ways with different criterias and often value of lets say field_15 depends of the value in field_1 etc etc.
I could make it a udf, and then do something like
select my_udf(sessionid, start, stop,.....) from QUERY_ABOVE but something tells me that it is a same thing as cursor, it will still process row by row, not set.
So, how do I avoid cursor?
I hope I've been clear enough. Any clues and help are welcome.
December 16, 2010 at 4:15 am
My default answer would be it can be translated into a set based soultion , but it depends on what is happening when you say 'do the procesing and formatting the output record'. Is that a single simple statement ?
Are you able to post the full code + DDL and sample data ?
December 16, 2010 at 5:05 am
Unfortunatelly, I'm not allowed to post full code:(
But I have a bunch of similar code for determining the value of each field in output record
/* FIELD 2 */
if @Field_1='F5'
select @field_2='63'
else
if len(@TerminatingIOI)>1
if @ConnType=0
select @field_2='54'
else
select @field_2='61'
else
if @ConnType=0
select @field_2='56'
else
select @field_2='60'
/* FIELD 3*/
if @field_1='F3'
begin
if @B_RECNUM=1 and @B_MSGTYPE = 3
select @field_3='2'
else if @B_RECNUM>1 and @B_MSGTYPE = 3
select @field_3='3'
else if @B_RECNUM>1 and @B_MSGTYPE = 4
select @field_3='4'
end
of course, these are very simple, there are few more complicated fields which depends on result of some stored procedure like this
execute checkportnum @POperNum,@field_14,@field_15, @Poper OUTPUT
if @Poper>'' select @field_11 = @Poper + @POperNum
etc etc
In the meantime, I've got an idea: create temporary table in which I should store the results from query and which will have additional columns needed for output. And then, instead of looping through the recordset and doing this formating row by row, I could use a series of updates based on the logic for each field as shown above?
December 16, 2010 at 5:19 am
If i was you i would start by simplifying the logic in the loop and moving it piecemeal into the cursor statement.
ie....
(Untested)
Create function CalcField_2(@Field1 char(2),@TerminatingIOI char(<Something>),@ConnType int)
returns table
as
return
(
Select Case when @Field_1 = 'F5'
else case when len(@TerminatingIOI)>1 then
case when @ConnType =0
then '54'
else '61' end
else
case when @ConnType =0
then '56'
else '60' end
end
end as Field_2
from (Select 1 as Col) as Row
)
So then you will be able to remove that code and use it like this...
Select * from
<YourSourceTable> Cross apply dbo.CalcField_2(Field_1,TerminatingIOI,ConnType )
After you've done that to all the smaller bits of code ,you should be in a better situation to remove the looping altogether.
BTW Notice how ive used an inline table function , not a scalar one.
December 16, 2010 at 5:42 am
if you are going to use functions, use scalars and not table based for the logic. That should allow you to maintain the code and not take too much of a performance hit. As for changing your cursor to a set based solution, as long as you are not calling stored procedures (logging messages or some such), you should be able to get to a single update statement.
My suggestion is you investigate using a CTE to simplify your JOINs and then create scalar functions to process the required transformations, which again will simplify writing the SQL. Once you have it working you can test moving different scalars into your core procedure until you find the right balance.
--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.
December 16, 2010 at 6:07 am
Charles.Otten (12/16/2010)
if you are going to use functions, use scalars and not table based for the logic. That should allow you to maintain the code and not take too much of a performance hit.
Im afraid i cant agree with that statement at all.
Scalar functions are nothing but a 'performance sink'. Very rarely ( twice or so this year) do i need one.
Any empty scalar func use more resources in the order of magnitudes
for example
create function Scalar(@id integer)
returns int
as
begin
return 1
end
go
go
create function Inline(@id integer)
returns table
as
return
( Select 1 as one
from (Select null as n) as x
)
go
select top(100000) 1,dbo.scalar(1)
from sys.columns a cross join sys.columns b
go
select top(100000) 1,inl.one
from sys.columns a cross join sys.columns b
cross apply dbo.Inline(1) inl
The other danger with scalar's is that some-else may decide to 'enhance' the function with more more functionality and add in a few SELECT's . 🙂
December 16, 2010 at 6:12 am
i agree with the scalar functionality that you are describing being bad. Virtually any scalar using a SELECT is just going to be awful. The scalars i'm talking about would be passing in the values that you want the function to "calculate". Also, note I suggested to use scalars to simplify the SQL development and then to remove the scalars by moving the functionality back into the source SQL.
A scalar function (that does not touch any table objects) is usually a pretty decent performance object and it does simplify coding. I do however, strongly suggest they get moved back in to the source SQL.
--------------------------------------------------------------------------
When you realize you've dug yourself into a hole....Step 1...stop digging.
December 16, 2010 at 7:24 am
Dave Ballantyne (12/16/2010)
After you've done that to all the smaller bits of code ,you should be in a better situation to remove the looping altogether.
BTW Notice how ive used an inline table function , not a scalar one.
So, in the end I'd have something like this?
Select * from
<YourSourceTable> Cross apply dbo.CalcField_1(param1,param2,param3,...)
Cross apply dbo.CalcField_2(param1,param2,param3,...)
...
Cross apply dbo.CalcField_20(param1,param2,param3,...)
It looks pretty messy and uninviting, I must admit:(
What about my idea with temp table and series of updates? Is it worth trying?
December 16, 2010 at 7:53 am
KennethParker (12/16/2010)
So, in the end I'd have something like this?
Select * from
<YourSourceTable> Cross apply dbo.CalcField_1(param1,param2,param3,...)
Cross apply dbo.CalcField_2(param1,param2,param3,...)
...
Cross apply dbo.CalcField_20(param1,param2,param3,...)
It looks pretty messy and uninviting, I must admit:(
What about my idea with temp table and series of updates? Is it worth trying?
Messy and uninviting , possibly 🙂 , but now this is giving you more options , you could create a view , calculated columns etc....
In any case , If I had to choose between 'messy' code and slow code , i know what i would choose 😉
The temp tables , has legs but may not be the most performant due to the constant updating. It could create a lot of logging data. I would do that as a secondary concern .
For now concentrate on getting as much logic as possible into the primary select.
December 16, 2010 at 7:54 am
Charles.Otten (12/16/2010)
i agree with the scalar functionality that you are describing being bad. Virtually any scalar using a SELECT is just going to be awful. The scalars i'm talking about would be passing in the values that you want the function to "calculate". Also, note I suggested to use scalars to simplify the SQL development and then to remove the scalars by moving the functionality back into the source SQL.A scalar function (that does not touch any table objects) is usually a pretty decent performance object and it does simplify coding. I do however, strongly suggest they get moved back in to the source SQL.
Unfortunatelly, in some cases I must use stored procedures or some Selects for determining the field value, so it means that scalar functions wouldn't throw out the cursor completly.
December 16, 2010 at 8:06 am
Dave Ballantyne (12/16/2010)
KennethParker (12/16/2010)
So, in the end I'd have something like this?
Select * from
<YourSourceTable> Cross apply dbo.CalcField_1(param1,param2,param3,...)
Cross apply dbo.CalcField_2(param1,param2,param3,...)
...
Cross apply dbo.CalcField_20(param1,param2,param3,...)
It looks pretty messy and uninviting, I must admit:(
What about my idea with temp table and series of updates? Is it worth trying?
Messy and uninviting , possibly 🙂 , but now this is giving you more options , you could create a view , calculated columns etc....
In any case , If I had to choose between 'messy' code and slow code , i know what i would choose 😉
The temp tables , has legs but may not be the most performant due to the constant updating. It could create a lot of logging data. I would do that as a secondary concern .
For now concentrate on getting as much logic as possible into the primary select.
I would agree here. Messy code is better than slow code.
I would look heavily into the temp table route. Many times a complex process will run faster by breaking the parts into a series of temp table insert/updates/delete/selects
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
December 16, 2010 at 2:05 pm
Perhaps I am restating the obvious, but at a minimum, you can use set-based operations to set values for all columns that don't require a stored procedure to set them. Given the set of operations you've described, I don't even see why you'd need temp tables, just a sequence of operations which also update a status flag in your source and/or target log tables. (You've stated you already have the former.) And then use a cursor for a much smaller set of variables and operations truly requiring sprocs.
The process would look something like this. First
INSERT INTO (a, b, c, etc)
SELECT * from -- your SessionID-matching join here
Hopefully you could also replace/refactor-away all of those stored procedures you are calling with more set based logic; this'd probably require that you add more joins to your original SessionID-matching join, but depending on the logic in your sprocs, it could be doable. Particularly if those sprocs aren't themselves infected with more cursor logic (or UPDATEs affecting multiple tables) and are SELECT-oriented.
Then, to indicate that you have processed the old records, even partially, you can do an UPDATE to your old log tables based on a very similar SELECT join clause. (You know that you can do an update to a single table based on joined tables, right?)
Then, if you hadn't eliminated all the sprocs, you could have your minimal cursor logic that iterates through a smaller set of fields based on either your earlier SessionID-matching join and/or your new target table. And some operation in the cursor would do an UPDATE on the needed rows.
And then (assuming you hadn't eliminated/refactored all the sprocs), you could re-update (again based on your SessionID-matching join condition) your update-status flag in the original log tables to indicate the update is fully (not just partially) complete.
(I'd also mention that if it eases implementation logic, and/or you want to refactor the sprocs away but such sprocs might in turn involve multiple UPDATEs, it might help to have a similar flag in your new combined-status table that tracks the update status, effectively tracking the SessionIDs which need to be operated on by your intermediate set operations, rather than repeating your big Session-ID-matching join each time. You insert those records early on, so now you can just operate-on/update the set of them that you just-inserted, once you have the flag you need.)
December 17, 2010 at 1:48 am
Thank you for your advices, you've all been very helpfull. Now I have clearer picture of what to do and which way to go. I'll let you know about the outcome of the whole project, hopefully it wouldn't be such a nightmare after all:)
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply