February 6, 2014 at 1:41 am
odeddror (2/5/2014)
Hi there,I'm using SQL Server 2012 x64 and VS 2013
I follow your code and
The T SQL return -0.99988192786007
And CLR return 0.999766931939805 (positive number instead of negative)
Why they are not the same?
Thanks,
Oded Dror
I'd need to see your input data to debug that.
February 6, 2014 at 6:54 am
T SQL code - I created stored proc and func
/* http://www.sqlservercentral.com/articles/SQLCLR/71942/ */
--linear_regression
USE Development
GO
CREATE TABLE dbo.test_table(
col1 FLOAT NOT NULL ,
col2 FLOAT NOT NULL
)
GO
INSERT INTO dbo.test_table
SELECT TOP 30000 ROW_NUMBER() OVER(ORDER BY s1.id DESC),ROW_NUMBER() OVER(ORDER BY s1.id ASC)
FROM sysobjects,sysobjects s1
GO
Select * from dbo.test_table
GO
--Create function that return table
alter function udf_linear_regression_X ()
Returns table
AS
Return
(
SELECT ( COUNT(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AS LinearX
FROM dbo.test_table
--select * from dbo.udf_linear_regression_X()
)
--Create proc that return one value
GO
alter proc usp_linear_regression_X
AS
Begin
declare @x FLOAT
declare @y FLOAT
declare @xy FLOAT
declare @x2 FLOAT
declare @n FLOAT
SELECT
@n = COUNT(*),
@x = SUM(col1),
@y = SUM(col2),
@xy = SUM(col1 * col2),
@x2 = SUM(col1* col1)
FROM dbo.test_table
select ((@n * @xy) - (@x*@y)) / (@n * @x2 - (@x*@x))
-- execute usp_linear_regression_X
End
GO
truncate table dbo.test_table
GO
CREATE ASSEMBLY LinearRegression FROM 'C:\Regression.dll'
GO
CREATE AGGREGATE Slope(@x FLOAT, @y FLOAT)
RETURNS FLOAT
EXTERNAL NAME LinearRegression.Slope;
GO
SELECT dbo.Slope(col1,col2)
FROM dbo.test_table
0.999766931939805 --CLR
-0.99988192786007 --SQL sp
-0.99988192786007 --SQL Func
GO
create VIEW v_Linear_regression_X
WITH SCHEMABINDING
as
SELECT --( COUNT_big(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT_big(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AGG
COUNT_big(*) cnt
,SUM(col1*col2) col1col2
,SUM(col1) col1
,SUM(col1*col1) col1sq
,SUM(col2) col2
FROM dbo.test_table
GO
code for Regression.CS
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[SqlUserDefinedAggregate(
Format.Native, //use clr serialization to serialize the intermediate result
IsInvariantToNulls = false, //optimizer property
IsInvariantToDuplicates = false, //optimizer property
IsInvariantToOrder = true //optimizer property
)]
public struct Slope
{
private long count;
private SqlDouble SumY;
private SqlDouble SumXY;
private SqlDouble SumX2;
private SqlDouble SumY2;
private SqlDouble SumX;
static void Main(string[] args)
{
}
public void Init()
{
count = 0;
SumX = SumY = SumXY = SumX2 = SumY2 = 0;
}
public void Accumulate(SqlDouble x, SqlDouble y)
{
if (!x.IsNull && !y.IsNull)
{
count++;
SumX += (SqlDouble)x;
SumY += (SqlDouble)y;
SumXY += (SqlDouble)x * (SqlDouble)y;
SumX2 += (SqlDouble)x * (SqlDouble)x;
SumY2 += (SqlDouble)y * (SqlDouble)y;
}
}
public void Merge(Slope other)
{
}
public SqlDouble Terminate()
{
if (count > 0)
{
SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));
return value * value;
}
else
{
return 0;
}
}
}
February 7, 2014 at 12:59 pm
@odeddror I think the answer is in the Terminate function
public SqlDouble Terminate()
{
if (count > 0)
{
SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));
return value * value;
}
else
{
return 0;
}
}
value * value is the equivalent of squaring the result of the preceding formula and the square of any number, whether it be positive or negative is always positive.
I've been hunting around for my old backup CDs as I replaced my hard drive 18 months ago and don't have my original VS2008 solution.
It takes a sharp eye to peer review some of the stats functions and the code that drive them so I suspect I may have mixed up the slope and r-squared formulas up.
The slope should be negative. It looks like you are seeing the squared slope value.
February 7, 2014 at 3:58 pm
Hi there,
It still positive number
Thanks,
Ed Dror
December 13, 2019 at 2:25 pm
Hi,
I'm a bit late to this discussion but we had a requirement to work out median values in SQL this week so I was looking for a method to achieve this without having to use windowed functions as I'm always looking for ways to keep things simple for our analysts. Using the examples in this post I was able to put together the following SQL-CLR function. I tested it with a dataset with 1,000,000 + rows and it worked really well.
Steve.
December 13, 2019 at 2:26 pm
Wouldn't let me upload file so here is the code:-
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;
using System.Collections.Generic;
using System.IO;
[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
Format.UserDefined, //Binary Serialization because of List<SqlDouble>
IsInvariantToNulls = false, //optimizer property
IsInvariantToDuplicates = false, //optimizer property
IsInvariantToOrder = true, //optimizer property
MaxByteSize = -1
)]
public struct MEDIAN : IBinarySerialize
{
private List<SqlDouble> lstValues;
public void Init()
{
lstValues = new List<SqlDouble>();
}
public void Accumulate(SqlDouble x) //this is the input parameter from SQL
{
if (!x.IsNull)
{
lstValues.Add(x);
}
}
public void Merge (MEDIAN Group)
{
lstValues.AddRange(Group.lstValues); //this merges the accumulated values if SQL decides to break this into threads
}
public SqlDouble Terminate () //this works out the result to return to SQL
{
if (lstValues.Count > 0)
{
lstValues.Sort(); //sort the list
int iMiddleValue = (int)Math.Ceiling(lstValues.Count / 2.0) - 1; //will round up if odd valued list and take off 1 as lists are zero based
SqlDouble dblMedian;
if (lstValues.Count % 2 > 0) //odd valued list
{
dblMedian = lstValues[iMiddleValue]; //take the middle value
}
else //even valued list
{
dblMedian = (lstValues[iMiddleValue] + lstValues[iMiddleValue+1]) / 2; //take the mean of the two middle values
}
return dblMedian;
}
else
{
return SqlDouble.Null;
}
}
public void Read(BinaryReader r)
{
if (r == null) throw new ArgumentNullException("r");
//SqlDouble value = r.ReadDouble();
lstValues = new List<SqlDouble>();
int pos = 0;
int length = r.ReadInt32();
while (pos < length)
{
lstValues.Add(r.ReadDouble());
pos++;
}
}
public void Write(BinaryWriter w)
{
if (w == null) throw new ArgumentNullException("w");
w.Write(lstValues.Count);
foreach (SqlDouble dbl in lstValues)
{
w.Write(dbl.Value);
}
}
}
December 13, 2019 at 2:59 pm
Hi,
I'm a bit late to this discussion but we had a requirement to work out median values in SQL this week so I was looking for a method to achieve this without having to use windowed functions as I'm always looking for ways to keep things simple for our analysts. Using the examples in this post I was able to put together the following SQL-CLR function. I tested it with a dataset with 1,000,000 + rows and it worked really well.
Steve.
That's cool and thank you for posting the code but what do you mean by "it worked really well"? I'm sure that includes that it worked correctly but I'm curious as to how long it took.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 13, 2019 at 4:51 pm
Hi Jeff,
Hi took a four column data set containing waiting times measurements with 1,006,934 rows and then used the median function to calculate the median wait. The query execution took 48 seconds and returned 47 rows. I haven't had time to compare with SQL only methods but think this is a pretty reasonable response time and the code looks really simple (please see below).
DECLARE
@StartDateAS DATE = '01/11/2012'
,@EndDateAS DATE = '31/10/2019';
SELECT
dcu.CommunityUnitLocalDescription
,tsrta.Division
,tsrta.Unit
,MedianWaitTime = Utilities.dbo.MEDIAN(DATEDIFF(DAY, dd2.ActualDate, dd.ActualDate))
FROM
dbo.Fact_Community_Activityfca
LEFT JOIN dbo.Dim_Community_Unitdcu
ON dcu.pkCommunityUnitKey = fca.CommunityUnitKey
LEFT JOIN dbo.Dim_Datedd
ON dd.pkDateKey = fca.CareContactDateKey
LEFT JOIN dbo.Fact_Referral fr
ON fr.pkReferralKey = fca.ReferralKey
LEFT JOIN dbo.Dim_Datedd2
ON dd2.pkDateKey = fr.ReferralReceivedDateKey
OUTER APPLY
(
SELECTTOP(1)
Division
,Unit
FROM
Reporting.dbo.tblSRRevisedTeamActivity
WHERE
Unit = dcu.CommunityUnitLocalDescription
)tsrta
WHERE
dd.ActualDate BETWEEN @StartDate AND @EndDate
AND fca.IsInitialConsultation= 1
GROUP BY
dcu.CommunityUnitLocalDescription
,tsrta.Division
,tsrta.Unit;
Overall I'm quite impressed with SQL-CLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before. By the way, I ran this on a SQL Server 2016 Developer edition virtual machine with 64Gb RAM and four cores.
Steve
December 13, 2019 at 6:28 pm
Awesome. Thanks for the feedback, Steve. I really appreciate it.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 14, 2019 at 5:49 pm
>>I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS. <<
I'm inclined to disagree with this. I would love to see SPSS, SAS and other stat packages be able to have embedded SQL, however. Statistical packages have to be able to handle some really weird floating-point math and other corrections under the covers. The whole idea of SQL was to manipulate data, not to do fancy calculations. You either have to put up with bad statistics from a product designed for data or incredibly poor performance from a computational product that's constantly adjusting data. There is a reason that we have a saw and a screwdriver in carpentry and we don't try to saw wood with a screwdriver or drive screws with a saw 🙂
Please post DDL and follow ANSI/ISO standards when asking for help.
December 15, 2019 at 10:46 pm
>>I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS. <<
I'm inclined to disagree with this. I would love to see SPSS, SAS and other stat packages be able to have embedded SQL, however. Statistical packages have to be able to handle some really weird floating-point math and other corrections under the covers. The whole idea of SQL was to manipulate data, not to do fancy calculations. You either have to put up with bad statistics from a product designed for data or incredibly poor performance from a computational product that's constantly adjusting data. There is a reason that we have a saw and a screwdriver in carpentry and we don't try to saw wood with a screwdriver or drive screws with a saw 🙂
Heh... I dunno Joe... I have a power screwdriver attachment on my power saw and a Dozuki blade attachment on my replaceable head screwdriver. 😀
Shifting gears a bit, I mostly agree with not fettering T-SQL with the types of things you speak of. There is some merit, though, in being able to do some of those things FROM T-SQL. That's why they've made it possible to interface "R" and Python from T-SQL.
Now, if we could just get them to create a Decimal/Numeric datatype that works as well as Granny's 4 function calculator without it occasionally automatically rounding to 6 decimal places. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
December 15, 2019 at 11:52 pm
Overall I'm quite impressed with SQLCLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before.
Hi Steve. If you want more examples of SQLCLR functionality to test / play with, there are plenty to choose from in the Free version of SQL#, a SQLCLR library that I wrote, including a few User-Defined Aggregates (one of which is Median 😺):
Take care,
Solomon...
SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR
December 16, 2019 at 6:03 am
Hi Jeff,
Hi took a four column data set containing waiting times measurements with 1,006,934 rows and then used the median function to calculate the median wait. The query execution took 48 seconds and returned 47 rows. I haven't had time to compare with SQL only methods but think this is a pretty reasonable response time and the code looks really simple (please see below).
Overall I'm quite impressed with SQL-CLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before. By the way, I ran this on a SQL Server 2016 Developer edition virtual machine with 64Gb RAM and four cores.
Steve
Hi Steve
Looking at your code, you have a LEFT JOIN with a WHERE. This is the same as an INNER JOIN.
It is my understanding that the logical query processing will provide better performance from the inner join with the filter in the join. I would be curious to see if the following makes a difference.
Replace
FROM dbo.Fact_Community_Activity fca
LEFT JOIN dbo.Dim_Date dd
ON dd.pkDateKey = fca.CareContactDateKey
WHERE dd.ActualDate BETWEEN @StartDate AND @EndDate
AND fca.IsInitialConsultation = 1
with
FROM dbo.Fact_Community_Activity fca
INNER JOIN dbo.Dim_Date dd -- Add the INNER JOIN immediately after the FROM to filter out sooner
ON dd.pkDateKey = fca.CareContactDateKey
AND dd.ActualDate BETWEEN @StartDate AND @EndDate
AND fca.IsInitialConsultation = 1
Also the following left join appears to serve no purpose other than to give SQL some extra work, so I would remove it
LEFT JOIN dbo.Fact_Referral fr
ON fr.pkReferralKey = fca.ReferralKey
December 16, 2019 at 11:39 am
Hi Des,
Sorry, I got the left join to Fact_Referral in the wrong place in the code I posted. I've changed the order in my post.
I tend to use left joins a lot as I find they work faster for me. I'm no DBA so don't fully understand the reasons behind this. In the case of this query I was curious so I've tried your INNER join and I got exactly the same results and same query plan as the original code. Not sure if this is always the case though. I would be interested to do more research into this topic as quite often I have to try to speed up code written by others.
Solomon - thanks for the link. I've definitely download this and have a look. Could save me a lot of time in future!
Steve
December 16, 2019 at 6:21 pm
SAS makes much of its ability to push models down to Teradata. From what I have seen of SAS you are best off doing as much as possible in the DB to get the performance you will need.
In terms of "should a DB platform have heavyweight statistical functions" Vertica is a column store that has a number of stats and ML functions. I think we are in the age where we take the code to the data and not the data to the code. It is probably easier to make a DB engine support stats than a stats package to play nicely with multi-concurrent data access
Viewing 15 posts - 31 through 44 (of 44 total)
You must be logged in to reply to this topic. Login to reply