A genuine use for a SQL CLR Aggregate

Question

A genuine use for a SQL CLR Aggregate

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply

David.Poole SSC Guru Points: 76083 More actions · Answer 1

odeddror (2/5/2014)
Hi there,
I'm using SQL Server 2012 x64 and VS 2013
I follow your code and
The T SQL return -0.99988192786007
And CLR return 0.999766931939805 (positive number instead of negative)
Why they are not the same?
Thanks,
Oded Dror

I'd need to see your input data to debug that.

LinkedIn Profile

odeddror SSC Eights! Points: 883 More actions · Answer 2

T SQL code - I created stored proc and func

/* http://www.sqlservercentral.com/articles/SQLCLR/71942/ */

--linear_regression

USE Development

GO

CREATE TABLE dbo.test_table(

col1 FLOAT NOT NULL ,

col2 FLOAT NOT NULL

)

GO

INSERT INTO dbo.test_table

SELECT TOP 30000 ROW_NUMBER() OVER(ORDER BY s1.id DESC),ROW_NUMBER() OVER(ORDER BY s1.id ASC)

FROM sysobjects,sysobjects s1

GO

Select * from dbo.test_table

GO

--Create function that return table

alter function udf_linear_regression_X ()

Returns table

AS

Return

(

SELECT ( COUNT(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AS LinearX

FROM dbo.test_table

--select * from dbo.udf_linear_regression_X()

)

--Create proc that return one value

GO

alter proc usp_linear_regression_X

AS

Begin

declare @x FLOAT

declare @y FLOAT

declare @xy FLOAT

declare @x2 FLOAT

declare @n FLOAT

SELECT

@n = COUNT(*),

@x = SUM(col1),

@y = SUM(col2),

@xy = SUM(col1 * col2),

@x2 = SUM(col1* col1)

FROM dbo.test_table

select ((@n * @xy) - (@x*@y)) / (@n * @x2 - (@x*@x))

-- execute usp_linear_regression_X

End

GO

truncate table dbo.test_table

GO

CREATE ASSEMBLY LinearRegression FROM 'C:\Regression.dll'

GO

CREATE AGGREGATE Slope(@x FLOAT, @y FLOAT)

RETURNS FLOAT

EXTERNAL NAME LinearRegression.Slope;

GO

SELECT dbo.Slope(col1,col2)

FROM dbo.test_table

0.999766931939805 --CLR

-0.99988192786007 --SQL sp

-0.99988192786007 --SQL Func

GO

create VIEW v_Linear_regression_X

WITH SCHEMABINDING

as

SELECT --( COUNT_big(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT_big(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AGG

COUNT_big(*) cnt

,SUM(col1*col2) col1col2

,SUM(col1) col1

,SUM(col1*col1) col1sq

,SUM(col2) col2

FROM dbo.test_table

GO

code for Regression.CS

using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;

[Serializable]

[SqlUserDefinedAggregate(

Format.Native, //use clr serialization to serialize the intermediate result

IsInvariantToNulls = false, //optimizer property

IsInvariantToDuplicates = false, //optimizer property

IsInvariantToOrder = true //optimizer property

)]

public struct Slope

{

private long count;

private SqlDouble SumY;

private SqlDouble SumXY;

private SqlDouble SumX2;

private SqlDouble SumY2;

private SqlDouble SumX;

static void Main(string[] args)

{

}

public void Init()

{

count = 0;

SumX = SumY = SumXY = SumX2 = SumY2 = 0;

}

public void Accumulate(SqlDouble x, SqlDouble y)

{

if (!x.IsNull && !y.IsNull)

{

count++;

SumX += (SqlDouble)x;

SumY += (SqlDouble)y;

SumXY += (SqlDouble)x * (SqlDouble)y;

SumX2 += (SqlDouble)x * (SqlDouble)x;

SumY2 += (SqlDouble)y * (SqlDouble)y;

}

public void Merge(Slope other)

{

}

public SqlDouble Terminate()

{

if (count > 0)

{

SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));

return value * value;

}

else

{

return 0;

}

David.Poole SSC Guru Points: 76083 More actions · Answer 3

@odeddror I think the answer is in the Terminate function

public SqlDouble Terminate()

{

if (count > 0)

{

SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));

return value * value;

}

else

{

return 0;

}

value * value is the equivalent of squaring the result of the preceding formula and the square of any number, whether it be positive or negative is always positive.

I've been hunting around for my old backup CDs as I replaced my hard drive 18 months ago and don't have my original VS2008 solution.

It takes a sharp eye to peer review some of the stats functions and the code that drive them so I suspect I may have mixed up the slope and r-squared formulas up.

The slope should be negative. It looks like you are seeing the squared slope value.

LinkedIn Profile

odeddror SSC Eights! Points: 883 More actions · Answer 4

odeddror

SSC Eights!

Points: 883

February 7, 2014 at 3:58 pm

#1687268

Hi there,

It still positive number

Thanks,

Ed Dror

stephen.lightfoot Valued Member Points: 64 More actions · Answer 5

Hi,

I'm a bit late to this discussion but we had a requirement to work out median values in SQL this week so I was looking for a method to achieve this without having to use windowed functions as I'm always looking for ways to keep things simple for our analysts. Using the examples in this post I was able to put together the following SQL-CLR function. I tested it with a dataset with 1,000,000 + rows and it worked really well.

Steve.

stephen.lightfoot Valued Member Points: 64 More actions · Answer 6

Wouldn't let me upload file so here is the code:-

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;
using System.Collections.Generic;
using System.IO;

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
        Format.UserDefined, //Binary Serialization because of List<SqlDouble>
        IsInvariantToNulls = false, //optimizer property
        IsInvariantToDuplicates = false, //optimizer property
        IsInvariantToOrder = true, //optimizer property
        MaxByteSize = -1
        )]
public struct MEDIAN : IBinarySerialize
{
    private List<SqlDouble> lstValues;

    public void Init()
    {
        lstValues = new List<SqlDouble>();
    }

    public void Accumulate(SqlDouble x) //this is the input parameter from SQL
    {
        if (!x.IsNull)
        {
            lstValues.Add(x);
        }
    }

    public void Merge (MEDIAN Group)
    {
        lstValues.AddRange(Group.lstValues); //this merges the accumulated values if SQL decides to break this into threads
    }

    public SqlDouble Terminate () //this works out the result to return to SQL
    {
        if (lstValues.Count > 0)
        {
            lstValues.Sort(); //sort the list
            int iMiddleValue = (int)Math.Ceiling(lstValues.Count / 2.0) - 1; //will round up if odd valued list and take off 1 as lists are zero based
            SqlDouble dblMedian;
            if (lstValues.Count % 2 > 0) //odd valued list
            {
                dblMedian = lstValues[iMiddleValue]; //take the middle value
            }
            else //even valued list
            {
                dblMedian = (lstValues[iMiddleValue] + lstValues[iMiddleValue+1]) / 2; //take the mean of the two middle values
            }
            return dblMedian;
        }
        else
        {
            return SqlDouble.Null;
        }
    }

    public void Read(BinaryReader r)
    {
        if (r == null) throw new ArgumentNullException("r");

        //SqlDouble value = r.ReadDouble();

        lstValues = new List<SqlDouble>();
        int pos = 0;
        int length = r.ReadInt32();
        while (pos < length)
        {
            lstValues.Add(r.ReadDouble());
            pos++;
        }

    }

    public void Write(BinaryWriter w)
    {
        if (w == null) throw new ArgumentNullException("w");
        w.Write(lstValues.Count);
        foreach (SqlDouble dbl in lstValues)
        {
            w.Write(dbl.Value);
        }
    }
}

This reply was modified 5 years, 2 months ago by stephen.lightfoot.

Jeff Moden SSC Guru Points: 1004508 More actions · Answer 7

stephen.lightfoot wrote:

Hi,
I'm a bit late to this discussion but we had a requirement to work out median values in SQL this week so I was looking for a method to achieve this without having to use windowed functions as I'm always looking for ways to keep things simple for our analysts. Using the examples in this post I was able to put together the following SQL-CLR function. I tested it with a dataset with 1,000,000 + rows and it worked really well.
Steve.

That's cool and thank you for posting the code but what do you mean by "it worked really well"? I'm sure that includes that it worked correctly but I'm curious as to how long it took.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

stephen.lightfoot Valued Member Points: 64 More actions · Answer 8

Hi Jeff,

Hi took a four column data set containing waiting times measurements with 1,006,934 rows and then used the median function to calculate the median wait. The query execution took 48 seconds and returned 47 rows. I haven't had time to compare with SQL only methods but think this is a pretty reasonable response time and the code looks really simple (please see below).

DECLARE
@StartDateAS DATE = '01/11/2012'
,@EndDateAS DATE = '31/10/2019';

SELECT
dcu.CommunityUnitLocalDescription
,tsrta.Division
,tsrta.Unit
,MedianWaitTime = Utilities.dbo.MEDIAN(DATEDIFF(DAY, dd2.ActualDate, dd.ActualDate))
FROM
dbo.Fact_Community_Activityfca
LEFT JOIN dbo.Dim_Community_Unitdcu
ON dcu.pkCommunityUnitKey = fca.CommunityUnitKey

LEFT JOIN dbo.Dim_Datedd
ON dd.pkDateKey = fca.CareContactDateKey

LEFT JOIN dbo.Fact_Referral fr
ON fr.pkReferralKey = fca.ReferralKey

LEFT JOIN dbo.Dim_Datedd2
ON dd2.pkDateKey = fr.ReferralReceivedDateKey

OUTER APPLY
(
SELECTTOP(1)
Division
,Unit
FROM
Reporting.dbo.tblSRRevisedTeamActivity
WHERE
Unit = dcu.CommunityUnitLocalDescription
)tsrta
WHERE
dd.ActualDate BETWEEN @StartDate AND @EndDate
AND fca.IsInitialConsultation= 1
GROUP BY
dcu.CommunityUnitLocalDescription
,tsrta.Division
,tsrta.Unit;

Overall I'm quite impressed with SQL-CLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before. By the way, I ran this on a SQL Server 2016 Developer edition virtual machine with 64Gb RAM and four cores.

Steve

This reply was modified 5 years, 2 months ago by stephen.lightfoot.

Jeff Moden SSC Guru Points: 1004508 More actions · Answer 9

Awesome. Thanks for the feedback, Steve. I really appreciate it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jcelko212 32090 SSCrazy Eights Points: 9304 More actions · Answer 10

>>I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS. <<

I'm inclined to disagree with this. I would love to see SPSS, SAS and other stat packages be able to have embedded SQL, however. Statistical packages have to be able to handle some really weird floating-point math and other corrections under the covers. The whole idea of SQL was to manipulate data, not to do fancy calculations. You either have to put up with bad statistics from a product designed for data or incredibly poor performance from a computational product that's constantly adjusting data. There is a reason that we have a saw and a screwdriver in carpentry and we don't try to saw wood with a screwdriver or drive screws with a saw

Please post DDL and follow ANSI/ISO standards when asking for help.

Jeff Moden SSC Guru Points: 1004508 More actions · Answer 11

jcelko212 32090 wrote:

>>I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS. <<
I'm inclined to disagree with this. I would love to see SPSS, SAS and other stat packages be able to have embedded SQL, however. Statistical packages have to be able to handle some really weird floating-point math and other corrections under the covers. The whole idea of SQL was to manipulate data, not to do fancy calculations. You either have to put up with bad statistics from a product designed for data or incredibly poor performance from a computational product that's constantly adjusting data. There is a reason that we have a saw and a screwdriver in carpentry and we don't try to saw wood with a screwdriver or drive screws with a saw

Heh... I dunno Joe... I have a power screwdriver attachment on my power saw and a Dozuki blade attachment on my replaceable head screwdriver.

Shifting gears a bit, I mostly agree with not fettering T-SQL with the types of things you speak of. There is some merit, though, in being able to do some of those things FROM T-SQL. That's why they've made it possible to interface "R" and Python from T-SQL.

Now, if we could just get them to create a Decimal/Numeric datatype that works as well as Granny's 4 function calculator without it occasionally automatically rounding to 6 decimal places.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 12

stephen.lightfoot wrote:

Overall I'm quite impressed with SQLCLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before.

Hi Steve. If you want more examples of SQLCLR functionality to test / play with, there are plenty to choose from in the Free version of SQL#, a SQLCLR library that I wrote, including a few User-Defined Aggregates (one of which is Median ):

https://SQLsharp.com/

Take care,

Solomon...

This reply was modified 5 years, 2 months ago by Solomon Rutzky.
This reply was modified 5 years, 2 months ago by Solomon Rutzky.

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

DesNorton SSC-Insane Points: 24454 More actions · Answer 13

stephen.lightfoot wrote:

Hi Jeff,
Hi took a four column data set containing waiting times measurements with 1,006,934 rows and then used the median function to calculate the median wait. The query execution took 48 seconds and returned 47 rows. I haven't had time to compare with SQL only methods but think this is a pretty reasonable response time and the code looks really simple (please see below).
Overall I'm quite impressed with SQL-CLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before. By the way, I ran this on a SQL Server 2016 Developer edition virtual machine with 64Gb RAM and four cores.
Steve

Hi Steve

Looking at your code, you have a LEFT JOIN with a WHERE. This is the same as an INNER JOIN.

It is my understanding that the logical query processing will provide better performance from the inner join with the filter in the join. I would be curious to see if the following makes a difference.

Replace

FROM dbo.Fact_Community_Activity  fca
LEFT JOIN dbo.Dim_Date            dd
   ON dd.pkDateKey = fca.CareContactDateKey
WHERE dd.ActualDate BETWEEN @StartDate AND @EndDate
  AND fca.IsInitialConsultation  = 1

with

FROM dbo.Fact_Community_Activity  fca
INNER JOIN dbo.Dim_Date           dd -- Add the INNER JOIN immediately after the FROM to filter out sooner
   ON dd.pkDateKey = fca.CareContactDateKey
  AND dd.ActualDate BETWEEN @StartDate AND @EndDate
  AND fca.IsInitialConsultation  = 1

Also the following left join appears to serve no purpose other than to give SQL some extra work, so I would remove it

LEFT JOIN dbo.Fact_Referral fr
   ON fr.pkReferralKey = fca.ReferralKey

How to post data/code on a forum to get the best help.

stephen.lightfoot Valued Member Points: 64 More actions · Answer 14

Hi Des,

Sorry, I got the left join to Fact_Referral in the wrong place in the code I posted. I've changed the order in my post.

I tend to use left joins a lot as I find they work faster for me. I'm no DBA so don't fully understand the reasons behind this. In the case of this query I was curious so I've tried your INNER join and I got exactly the same results and same query plan as the original code. Not sure if this is always the case though. I would be interested to do more research into this topic as quite often I have to try to speed up code written by others.

Solomon - thanks for the link. I've definitely download this and have a look. Could save me a lot of time in future!

Steve

David.Poole SSC Guru Points: 76083 More actions · Answer 15

SAS makes much of its ability to push models down to Teradata. From what I have seen of SAS you are best off doing as much as possible in the DB to get the performance you will need.

In terms of "should a DB platform have heavyweight statistical functions" Vertica is a column store that has a number of stats and ML functions. I think we are in the age where we take the code to the data and not the data to the code. It is probably easier to make a DB engine support stats than a stats package to play nicely with multi-concurrent data access

LinkedIn Profile

A genuine use for a SQL CLR Aggregate

Cookies on SQLServerCentral