July 10, 2008 at 5:26 pm
I'm trying to create a LAG function, this will be of the form LAG(column,x) and will return the value from "column" x rows before the current row. I know it can be done in TSQL by joining the data to it'self on row_number and row_number -x but I want to see about doing it using a UDF and CLR.
here is my code thus far.
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Partial Public Class UserDefinedFunctions
_
Public Shared Function Lag(ByVal value As Object, ByVal iLag As Integer) As Object
Dim oReturnedValue As Object = ""
Static Dim Values As ArrayList
Try
If IsNothing(Values) Then
Values = New ArrayList
End If
Values.Add(value)
If iLag < Values.Count Then
oReturnedValue = Values(Values.Count - iLag - 1)
Values.Remove(0)
Else
oReturnedValue = ""
End If
Catch ex As Exception
' catch error
End Try
Return oReturnedValue
End Function
End Class
Now this seems to work fine in VS2008 debug mode using the following test script
with data as
(
select 1 as col
union all
select 2 as col
union all
select 3 as col
union all
select 4 as col
union all
select 5 as col
)
selectcol,
dbo.Lag(col,2) as lag2
from data
giving the following results
col lag2
1
2
3 1
4 2
5 3
however if I include an additional column like
with data as
(
select 1 as col
union all
select 2 as col
union all
select 3 as col
union all
select 4 as col
union all
select 5 as col
)
selectcol,
dbo.Lag(col,2) as lag2
dbo.Lag(col,3) as lag3
from data
this gives a result of
col lag2 lag3
1
2 1 1
3 2 2
4 3 3
5 4 4
Which is a bit strange, I'm guessing that it's got somethign to do with the static arraylist being retained or used by both calls. Does anyone out there have any other ideas on this.
July 10, 2008 at 9:09 pm
This isn't something I would use CLR for. I am curious why you want to do this in CLR like this. A few things that I see is that you aren't using the SQL native types for your output value or inputs. You also are storing data in a static field which means this has to be an UNSAFE assembly. This is against recommended best practices for implementing CLR inside SQL Server.
You can get the same output using just TSQL and save the expense that CLR is going to cost here:
with data as
(
select 1 as col
union all
select 2 as col
union all
select 3 as col
union all
select 4 as col
union all
select 5 as col
)
select col,
CASE WHEN (col-2) <= 0 then '' else convert(varchar,(col-2)) end as lag2,
CASE WHEN (col-3) <= 0 then '' else convert(varchar,(col-3)) end as lag3
from data
Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
My Blog | Twitter | MVP Profile
Training | Consulting | Become a SQLskills Insider
Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]
July 10, 2008 at 9:29 pm
I am not a VB.NET guru, I do everything in C#, so I had to run this through a converter, but the following should do what you want while maintaining SAFE CAS permissions:
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Public Partial Class UserDefinedFunctions
_
Public Shared Function Lag(ByVal value As SqlInt32, ByVal lag As SqlInt32) As SqlInt32
If value - lag < 0 OrElse value - lag = 0 Then
Return New SqlInt32()
End If
Return New SqlInt32(CInt(value) - CInt(lag))
End Function
End Class
I am really confused why you would want to do this in CLR though, and why you chose to implement a ArrayList for it. The rule of thumb is that if you can do it with TSQL, you don't put it in CLR. In reality, and I am sure Jeff or Matt will back this statement up, there is precious little that you probably should be implementing in CLR inside of SQL. I have written alot of CLR objects for SQL Server, but very little has ever made it past just playing around with it. There are a few shortcuts that I have done for myself as a DBA to make my own life easier, but most things that I would try in SQL using CLR I probably want outside of my databases instead.
There are a lot of reasons for this, memory being one of them, segregation of workloads and environments being another.
Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
My Blog | Twitter | MVP Profile
Training | Consulting | Become a SQLskills Insider
Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]
July 11, 2008 at 12:38 am
Thanks for your feedback guys but I think you have both missed the point. I'm not doing simple arithmetic with the values. The lag function will return a value from a row other than the row that is currently being processed. so LAG(col1,5) will return the value of col1 from the row 5 rows before the current row.
And yes, while I know I can do this in SQL by assiging a row number using the ROW_NUMBER function then joining the data to itself on rownumber = rownumber - lagvalue or by using a select statement to pick the row based on the current row. I'd like to try it in CLR.
Adrian.
July 11, 2008 at 8:23 am
Having a static member variable isn't going to allow you to ever guarantee results. This is because until the appdomain unloads, your variable contains the last values set to it. This is why code that has static variables that aren't readonly requires unsafe access. What you want to do is like reinventing the wheel for no other reason than to try it, which is fine, and I hope you learn something from it, I know I have done that numerous times with CLR in SQL. Even if you do create a function that does what you want, which I don't think you will, I am pretty sure that if you test it against a row_number() joined solution, you are going to find that it is won't perform as good as the TSQL.
Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
My Blog | Twitter | MVP Profile
Training | Consulting | Become a SQLskills Insider
Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]
July 13, 2008 at 11:50 pm
Thanks Jonathon,
Thats what I was thinking. However I did think that a CLR routine might be quicker than SQL as it wouldn't need to need to read the data more than once as it can just store the past few previous values and pop them out when required. Anyway, the inability to know when the SQL statement has finished so I could clear the static variables pretty much stops me doing what I wanted to do this way.. so it'll be an SQL solution.
Thanks again.
July 19, 2008 at 10:16 pm
FYI, although I would not advise it, this can be done.
What you would need to do is to add a "handle" argument to your function, and then find a way to generate a unique handle value to pass in (I can think of several). Now you VB SQLFunction can create a Static List of Arraylist and use the handle as the index into the List to get the ArrayList specific to you particular context.
Of course, now you have Clean-Up issues and you still have the "Unsafe" problem.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
July 20, 2008 at 11:34 pm
Yep, agree with RBarryYoung on this one. CLR is definitely the wrong tool for this task - even with his indexer method, in your example best case you're going to have a copy of your working data set per nested function call in CLR memory space. Not something you really want happening on your DB server...
As you point out the fastest approach algorithmically is a single pass through the data set. Have a look at Jeff's article here[/url] on using the "quirky update" method to calc running totals and related values. I'm sure you could adapt that technique to do your lag calculations - although it may be less than graceful looking in the end (no function, lots of accumulator variables, etc.) it will blow the doors off of any other method performance-wise.
Regards,
Jacob
February 26, 2011 at 5:54 am
Very interesting post, which I found while doing a search for static members on sqlclr functions in order to solve the same problem! I had previously just thought that it wasn't going to be possible to have them.
I have to say that I think this isn't as bad an idea as it is being made out to be. Getting the previous row value for a column can be essential when trying to calculate things such as deltas, which is an example of where the "quirky" update method wouldn't help.
It also has applications where you need to transform some data, setting values based on the last value you saw, and you don't want to have to resort to cursors, or self-joins.
However, as has been said, an implementation using statics is surrounded by pitfalls, however, it's not impossible to do (though not ideal and somewhat unsafe) and it does performs better than the equivalent row_number() based self join.
Basically, there a number of problems to solve, namely: (i) memory footprint, (ii) cleanup, as statics don't clear until an app domain unload, so even other sessions can see that static state, (iii) safety.
For (i) & (ii), as RBarryYoung points out you could use some kind of unique ID to identify which lag you need to return values for. However, if you use an ID that's too unique, like a GUID, you'll keep allocating memory without cleaning it up: not a good idea.
However, what you can do is limit the amount of memory you use to a pretty much fixed amount. As long as this is a fairly small amount, you can get away without doing cleanup, or letting app unload take care it.
How do you do this? Well, firstly you don't store an array for the values that you want to return. You can convert that code to store only 1 value no matter how many values you want to lag by, based on the number of values that you have "seen". I'll leave this as a exercise for the reader.
Secondly, you allocate an ID based on the session, the time the query started and the instance of lag being used in the query. This is enough to isolate the instance of lag from any other. However, and this is where it's not so nice: you'll need the user to give you these values as parameters. You could look at ways of "hiding" this (like a wrapper function).
So, your function call in sql would be something like:
select a, b, dbo.lag(@@spid, getdate(), 1, a, 1) from ...
to lag by 1.
Now, inside the sqlclr function you keep a dictionary of spid's mapped to a control block containing the datetime, the lag value, etc.. If the map doesn't contain an entry for a session, create one. If it does, and the datetime passed in doesn't match the stored one, clear out the control block, and store the new datetime.
You would also do some validation on the session id passed in, like limiting it's range, like to say 1000. By doing this, you are saying that until an appdomain unload, you are going to store a maximum of 1000 control blocks. Each control block will have a different lag value, but it would be only 1 value, and you could also constrain the value's size.
This way, if you have a sensible memory configuration on your server, you don't have to worry about cleanup.
If the user gets some of these "special" parameters wrong, it doesn't affect memory footprint, unless they repeatedly get it wrong for the session id, however, only up to a maximum amount. It does mean that they will screw up their results, and maybe somebody elses! However, this would only happen if the wrongly enter session id was actually connected and was running a lag query at the same time.
If you did write a wrapper function that took spid out of the parameters, you would get safety, but at the cost of the extra function call, which isn't completely cheap. Safety vs Speed, the choice is yours.
On that note, as for number (iii), there is more than just assembly safety. As just mentioned, incorrect use could potentially screw up results in more than the current session, and taking the other issues into account, I wouldn't put this into Production, though maybe into dev/test for ad hoc queries, and even then probably only for myself.
As for assembly safety, you could just create a special database for your unsafe code, and put the assembly in there. This function is kind of semi-unsafe in my opinion. There's a small chance you might get the wrong results (through incorrect use), but hopefully you're not going to crash the server.
Also, for thread-safety, use a synchronized block or equivalent around access to the dictionary.
There is also ordering to think about, and this one concerns me the most. From my tests, the execution plans have shown Compute Scalar always happening after the Sort, which is what you want. However, the optimizer is allowed to break that, which would give you the wrong result.
One final note on safety is parallelism, which, if the optimizer decided to use it, could totally screw up results, so ensure you use maxdop 1.
Finally, as for performance, doing this:
-- 175616 rows
select o.*
into tmp
from sysobjects o, sysobjects o2, sysobjects
GO
and then:
with o as (
select *, row_number() over (order by id) as rownum
from tmp
)
select * from o
inner join o o2 on o.rownum - 1 = o2.rownum
GO
gave me an average (over 3 runs, starting from a cold cache) runtime of 8453 ms.
Doing this:
create clustered index IX on tmp (id)
GO
then gave an average runtime of 7223 ms.
Both times, the execution plan showed a merge join.
Comparing this to:
select *, dbo.lag(@@spid, getdate(), 1, id, 1)
from tmp
order by id
without the clustered index gave an average runtime of 6837 ms, and with it 5959 ms.
This is pretty much a 20% improvement in performance. I was sure this would be the case from the outset, and couldn't understand why it was being said that it wasn't the case, however, I'll admit, I thought it might have been a bit faster still.
The other major benefit of this is that even though it has those special parameters at the being, it is much easier to write than the equivalent self-join, especially when you need to apply it to a query based on more than one table.
So, even though not completely ideal, something that could be done. Really, this is something that Microsoft should be providing, like other vendors do. But until then ...
Muhmud.
February 27, 2011 at 3:40 am
Vote for SQLCLR user-defined analytic functions here:
On a separate note, and while I agree that LAG would be problematic for large LAG values (memory requirement), the interested person could do worse that look at how Adam Machanic's Query Parallelizer uses Thread Statics and other trickery:
http://sqlblog.com/files/folders/beta/entry29021.aspx
Certainly not for the amateur, or faint-of-heart.
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply