A genuine use for a SQL CLR Aggregate

Question

A genuine use for a SQL CLR Aggregate

Viewing 15 posts - 16 through 30 (of 44 total)

You must be logged in to reply to this topic. Login to reply

hsun Newbie Points: 7 More actions · Answer 1

Have you tried to use sqlSingle instead of sqlDouble in the CLR?

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 2

David.Poole (12/29/2010)
Thanks Solomon, what I was getting at with the Median is that to find the middle value you have to keep track of all the values and my worry (as yet unsubstantiated) is that you could run an aggregate on such a large recordset that you consume all the RAM on your system and destabilize your server.

Hello David. I am not sure this is a real issue since in SQL Server 2005 the max memory that the UDA can take up is 8k and if using the -1 MaxSize starting in SQL Server 2008 you still only get up to 2 GB. And yes, 2 GB is a lot but it is a constraint that is presumably less than the total amount of memory available to SQL Server. Also, with using the compression the memory usage will be less. Of course, as Adam pointed out it is possible to do this (Median) in straight T-SQL but I have also included "Geometric Average" and "Root Mean Square" Aggregates in my SQL# library and neither one needs to keep the entire set in memory so it doesn't matter how large the dataset is. And to your final point, I also think that Microsoft should include at least those two ("Geometric Average" and "Root Mean Square") if not all three as built-in Aggregates.

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 3

Bradley Deem (12/29/2010)
I think part of this limitation is due to the lack of ability to perform an ORDER BY clause with an aggregate.
Assuming I've implemented it correctly, this could produce output such as
firstRowValue,secondRowValue,thirdRowValue
or maybe parallelism kicks in (on a large set) and produces output where the order changes ie
secondRowValue,firstRowValue,thirdRowValue
Now, if we had we had the ability to perform the ORDER BY clause with the OVER clause ie
SELECT dbo.List(myStringColumn,',') OVER(ORDER BY myStringColumn)
FROM myTable
Then we could guarantee the correct (expected) order of output.

Hello Bradley. I am not sure that I agree that this cannot be done in the Aggregate (hence requiring the ORDER BY). Maybe I am thinking of something that is not accurate, but I think this can be handled in the "Terminate()" method. The parallelism issue that you mention is handled via the "Merge()" method but the output still needs to go through the "Terminate()" method so it is there that you could properly order the string. Would that work?

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

Bradley Deem SSCrazy Points: 2565 More actions · Answer 4

Solomon Rutzky (12/29/2010)
Bradley Deem (12/29/2010)
I think part of this limitation is due to the lack of ability to perform an ORDER BY clause with an aggregate.
Assuming I've implemented it correctly, this could produce output such as
firstRowValue,secondRowValue,thirdRowValue
or maybe parallelism kicks in (on a large set) and produces output where the order changes ie
secondRowValue,firstRowValue,thirdRowValue
Now, if we had we had the ability to perform the ORDER BY clause with the OVER clause ie
SELECT dbo.List(myStringColumn,',') OVER(ORDER BY myStringColumn)
FROM myTable
Then we could guarantee the correct (expected) order of output.
Hello Bradley. I am not sure that I agree that this cannot be done in the Aggregate (hence requiring the ORDER BY). Maybe I am thinking of something that is not accurate, but I think this can be handled in the "Terminate()" method. The parallelism issue that you mention is handled via the "Merge()" method but the output still needs to go through the "Terminate()" method so it is there that you could properly order the string. Would that work?

You are correct, you could handle this in the terminate method, but who wants to determine the fastest CLR method to sort strings (or some other input) when we have SQL at our finger tips?

DEK46656 SSCrazy Points: 2111 More actions · Answer 5

CELKO (12/29/2010)
rdw-769613 (12/29/2010)
Another good example to use CLR is GMT/Local/Unix time handling. Trying this in T-SQL is near imposible and within CLR it's simple and faster.
I always did this by mimicking the ANSI/ISO Standards system tables for the WITH TIMEZONE option in the DDL. Then everything is done with VIEWs. No need for a CLR.

Joe, do you have an article on this that you could link, or is this in one of your books? I've had my own issues with this over time 😀 and would like to see your implementation.

Beer's Law: Absolutum obsoletum
"if it works it's out-of-date"

Mark Cowne One Orange Chip Points: 26967 More actions · Answer 6

Solomon Rutzky (12/29/2010)
Adam Machanic (12/29/2010)
Have you compared the performance of doing a median calculation w/compression against performing the same calculation using Joe Celko's ROW_NUMBER technique?
http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx
Compression is far from cheap, and I suspect that the ROW_NUMBER solution will scale much better. But that's just a suspicion and I don't have time to test myself at the moment.
Hi Adam. Thanks for mentioning that article as I had not seen it. I did just test this against the AdventureWorks2008 DB using SQL Server 2008 SP2 using the Celko method, Ben-Gan method, and my CLR UDA. The CLR method has the same number of reads as the Celko method but has slightly higher CPU and Elapsed times as compared to both Celko and Ben-Gan methods.
So yes, the pure T-SQL methods do appear to be slightly more efficient. However, for the record I did not say that compression was cheap. In fact, I did say that it would be slightly less efficient to do that but that it might prove to be a worthwhile trade-off depending on the circumstances.
My testing is as follows (the first two queries are copied directly from your blog that you noted above):
------------------------------------------------------------------------
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
FROM Sales.SalesOrderHeader SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
-- logical reads 686
-- CPU time = 47 ms, elapsed time = 1087 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)
-- CPU time = 62 ms, elapsed time = 235 ms. (when cached)
------------------------------------------------------------------------
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue) AS RowNum,
COUNT(*) OVER (
PARTITION BY CustomerId) AS RowCnt
FROM Sales.SalesOrderHeader
) x
WHERE
RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)
GROUP BY CustomerId
ORDER BY CustomerId;
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
-- logical reads 686 + 139407 for Worktable = 140093 total
-- CPU time = 344 ms, elapsed time = 1085 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)
-- CPU time = 328 ms, elapsed time = 374 ms. (when cached)
------------------------------------------------------------------------
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECTCustomerId,
SQL#.SQL#.Agg_Median(TotalDue)
FROM Sales.SalesOrderHeader SOH
GROUP BY CustomerId
ORDER BY CustomerId;
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
-- logical reads 686
-- CPU time = 1812 ms, elapsed time = 2707 ms. (when DROPCLEANBUFFERS and FREEPROCCACHE are run)
-- CPU time = 1906 ms, elapsed time = 1948 ms. (when cached)
------------------------------------------------------------------------

I don't think the CELKO method works correctly in all circumstances, the problem occurs when there are duplicates and the ASC and DESC ROW_NUMBERs don't match up. The data below shows the problem, the result should be 15, the Ben-Gan is correct here.

INSERT INTO SalesOrderHeader(CustomerId, TotalDue, SalesOrderId)

SELECT 100,10,10 UNION ALL

SELECT 100,20,10 UNION ALL

SELECT 100,30,10

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 7

Mark-101232 (1/4/2011)
I don't think the CELKO method works correctly in all circumstances, the problem occurs when there are duplicates and the ASC and DESC ROW_NUMBERs don't match up. The data below shows the problem, the result should be 15, the Ben-Gan is correct here.
INSERT INTO SalesOrderHeader(CustomerId, TotalDue, SalesOrderId)
SELECT 100,10,10 UNION ALL
SELECT 100,10,10 UNION ALL
SELECT 100,20,10 UNION ALL
SELECT 100,30,10

Great catch, Mark!! I tested again with your data and my code (posted above), this time into a new table of just this data into a new DB as well:

USE [Test]

GO

CREATE TABLE dbo.SalesOrderHeader (CustomerId INT, TotalDue INT, SalesOrderId INT)

INSERT INTO dbo.SalesOrderHeader(CustomerId, TotalDue, SalesOrderId)

SELECT 100,10,10 UNION ALL

SELECT 100,20,10 UNION ALL

SELECT 100,30,10

And then of course change the "Sales" Schema references to "dbo" in the previously noted sample code.

It should be noted that the CLR (via SQL#) Median Aggregate (UDA) also worked correctly. And this is one reason why I definitely like CLR for things like this since it encapsulates complex logic. The harder code is to read the more likely it will get messed up by someone either in the beginning or when making changes. The CLR Aggregate gives a very simple and accurate way of doing an otherwise complex task that leaves the SQL developer to worry about the rest of the query without the complicating factor of what was needed to get the Median working. So while it is slightly less efficient, I do see definite benefits to it :-).

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

Adam Machanic SSCoach Points: 15370 More actions · Answer 8

Mark-101232 (1/4/2011)
I don't think the CELKO method works correctly in all circumstances, the problem occurs when there are duplicates and the ASC and DESC ROW_NUMBERs don't match up.

Garbage in, garbage out. If you pass in nonsensical data that lacks a key, you'll get back a nonsense answer. Break the ties and everything works fine.

--
Adam Machanic
whoisactive

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 9

Adam Machanic (1/4/2011)
Garbage in, garbage out. If you pass in nonsensical data that lacks a key, you'll get back a nonsense answer. Break the ties and everything works fine.

D'oh! 😉 I just retested with unique SalesOrderId values and the Celko method did work correctly.

I still do like the simplicity of the UDA, though.

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

steve.ledridge SSC-Addicted Points: 433 More actions · Answer 10

I found your code useful.

http://databasebuddha.blogspot.com/2013/06/better-linear-regression-forecasting.html

armarshall SSC Enthusiast Points: 155 More actions · Answer 11

The code examples are not correctly formatted in some cases at least on Chrome. Need to tweak those in the post.

I would gently and humbly offer that the CLR provides a bit better testability potential? SQL is notoriously hard to write unit tests for. However, I may be completely off base.

GBimberg Hall of Fame Points: 3285 More actions · Answer 12

From my point of view, the more SQL aggregates, the better. Most of SQL computations I do involve things like Rolling Averages, Rolling Standard Deviations, Rolling Minimums and Rolling Maximums.

Part of the reason we are going from SQL 2005 to SQL 2012 is the ability to compute the Rolling Percentiles.

Adam Machanic SSCoach Points: 15370 More actions · Answer 13

For those of you who are interested in this topic, here's a link to a presentation I did at last year's TechEd show:

http://channel9.msdn.com/Events/TechEd/NorthAmerica/2013/DBI-B404

This presentation discusses a specific CLR aggregate-based technique that -- at least in my humble opinion -- completely opens the door with regard to functionality.

Enjoy 🙂

--
Adam Machanic
whoisactive

GBimberg Hall of Fame Points: 3285 More actions · Answer 14

Here is a problem that SQL Aggregates are good at solving:

I work at a Waste Water Treatment plant and have 30,000 tags/data points that I am monitoring. These tags tell me things like how fast the sludge is flowing, how fast the pumps are moving, tank levels ect…

The problem I have is that when a tag gets dirty, it can send me the same value over and over again.

The solution is to get an alert when the Rolling Standard Deviation approaches zero.

odeddror SSC Eights! Points: 883 More actions · Answer 15

Hi there,

I'm using SQL Server 2012 x64 and VS 2013

I follow your code and

The T SQL return -0.99988192786007

And CLR return 0.999766931939805 (positive number instead of negative)

Why they are not the same?

Thanks,

Oded Dror