November 14, 2007 at 6:46 am
Matt Miller (11/14/2007)
Giants? Nah - I aint that tall....More like - boys with their toys:)
You said it. 😉
What I'll do is grab one of the string functions that I have in my main database, rewrite it in as efficient T-SQL as I can manage, then hack up a C# version, and see how they compare. Doubt it'll be before friday though. Considering that my sQL is way better than my C#, it won't be really fair
Oh, I will add in an additional challenge if you two feel like it.
Take data of the following form, with a lot, lot more data (say 100 000 rows and 250 categories)
CREATE TABLE #Test
Category INT,
Code VARCHAR(5)
)
INSERT INTO #Test VALUES (1,'a')
INSERT INTO #Test VALUES (1,'t')
INSERT INTO #Test VALUES (1,'e')
INSERT INTO #Test VALUES (1,'s')
INSERT INTO #Test VALUES (1,'u')
INSERT INTO #Test VALUES (2,'u')
INSERT INTO #Test VALUES (2,'e')
INSERT INTO #Test VALUES (2,'n')
INSERT INTO #Test VALUES (2,'a')
INSERT INTO #Test VALUES (3,'i')
INSERT INTO #Test VALUES (3,'z')
and output data of the form. (order of the characters isn't important, since there's nothing to order by)
1, 'a,t,e,s,u'
2, 'u,e,n,a'
3, 'i,z'
My gut feel is that a user-defined aggregate will out perform any T-SQL by quite a way.
Or if Jeff wants to produce the T-SQL, I can give the C# a try, but it won't be for a couple days.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 14, 2007 at 8:59 am
All right - here's the CLR solution for the Comedy-Separated Value field challenge.
First - the CLR function:
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Runtime.InteropServices
Partial Public Class UserDefinedFunctions
Private Class SplitInfo
Private _val As Object
Private _pos As Integer
Property value() As Object
Get
Return _val
End Get
Set(ByVal value As Object)
_val = value
End Set
End Property
Property position() As Integer
Get
Return _pos
End Get
Set(ByVal value As Integer)
_pos = value
End Set
End Property
End Class
<Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName:="FillRowSplitExp", IsDeterministic:=True, IsPrecise:=True, TableDefinition:="val nvarchar(max),pos integer")> _
Public Shared Function RegexSplitExp(ByVal input As SqlString, ByVal pattern As SqlString) As IEnumerable
Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value)
Dim c As New Collection
Dim s As SplitInfo
Dim i As Integer
Dim T As String()
Dim m As System.Text.RegularExpressions.MatchCollection
If input.IsNull Then
s = New SplitInfo
s.value = System.DBNull.Value
s.position = -1
c.Add(s)
Else
If rex.IsMatch(input.Value) = False Then
s = New SplitInfo
s.position = 0
s.value = ""
c.Add(s)
Else
T = rex.Split(input.Value)
m = rex.Matches(input.Value)
For i = 0 To T.GetUpperBound(0)
s = New SplitInfo
s.value = T(i).ToString
If i = 0 Then
s.position = 0
Else
s.position = m(i - 1).Index + 1
End If
'If s.value.Equals("") Then
' s.value = " "
'End If
c.Add(s)
Next
End If
End If
Return c
End Function
Public Shared Sub FillRowSplitExp(ByVal obj As Object, ByRef pos As Integer)
Dim s As SplitInfo
s = CType(obj, SplitInfo)
Val = s.value.ToString
pos = s.position
End Sub
End Class
the SQL code is next
use testing
go
dbcc freeproccache
dbcc dropcleanbuffers
--set things up
declare @StartTime datetime
IF OBJECT_ID('TempDB..#mattcomedyTest','U') IS NOT NULL
drop table #mattcomedyTest
PRINT REPLICATE('=',78)
PRINT SPACE(12)+'Matt''s CLR Solution for Indexed Parsing.'
PRINT REPLICATE('-',78)
--the clock is ticking...
set @StartTime=getdate()
--build the table
select someid,
case when s.pos=-1 then null else s.val end as val, --handle the NULL values
s.pos+1 as idx
into #mattcomedyTest
from jbmtest
cross apply dbo.regexsplitExp(somecsv,',') s
--the unique index - not clustered since you're not doing clustered either :)
CREATE UNIQUE INDEX IX_mattcomedyTest_SomeID_Idx
ON #mattcomedyTest (SomeID,[Idx])
--===== Display the duration
PRINT CONVERT(CHAR(12),GETDATE()-@StartTime,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--The 100
select top 100 * from #mattcomedyTest order by SomeID,Idx
As expected - the T-SQL wins thanks to the CROSS Apply in the CLR version.
results:
=============================================================================
Matt's CLR Solution for Indexed Parsing.
------------------------------------------------------------------------------
(999956 row(s) affected)
(1 row(s) affected)
00:00:13:383 Duration (hh:mi:ss:mmm)
=============================================================================
vs
=============================================================================
Jeff Moden's Tally Table Solution for Indexed Parsing.
------------------------------------------------------------------------------
(999956 row(s) affected)
00:00:08:927 Duration (hh:mi:ss:mmm)
=============================================================================
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 14, 2007 at 9:29 am
Phone Test phase 1 - CLR version....
the SQL
use testing
go
dbcc freeproccache
dbcc dropcleanbuffers
--set things up
declare @StartTime datetime
IF OBJECT_ID('TempDB..#mattPhoneExtTest','U') IS NOT NULL
drop table #mattPhoneExtTest
PRINT REPLICATE('=',78)
PRINT SPACE(12)+'Matt''s CLR Solution for Extracting Phone Numbers'
PRINT REPLICATE('-',78)
--the clock is ticking...
set @StartTime=getdate()
--create the table
select
rid,
case when r.pos=-1 then null else r.val end as val,
r.pos+1 as idx
into #mattPhoneExtTest
from MattTestText
cross apply dbo.regexmatches(doc,'\([0-9]{3}\)[ ]{0,1}[0-9]{3}\-[0-9]{4} ') r
--the unique index - not clustered since you're not doing clustered either :)
CREATE UNIQUE INDEX IX_mattPhoneExtTestt_RID_Idx
ON #mattPhoneExtTest (rID,[Idx])
--===== Display the duration
PRINT CONVERT(CHAR(12),GETDATE()-@StartTime,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--The 100
select top 100 * from #mattPhoneExtTest order by rID,Idx
The CLR function
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Runtime.InteropServices
Partial Public Class UserDefinedFunctions
Private Class MatchInfo
private _rank as integer
Private _val As Object
Private _pos As Integer
Private _len As Integer
Public Property rank() As Integer
Get
Return _rank
End Get
Set(ByVal value As Integer)
_rank = value
End Set
End Property
Public Property value() As Object
Get
Return _val
End Get
Set(ByVal val As Object)
_val = val
End Set
End Property
Public Property position() As Integer
Get
Return _pos
End Get
Set(ByVal value As Integer)
_pos = value
End Set
End Property
Public Property length() As Integer
Get
Return _len
End Get
Set(ByVal value As Integer)
_len = value
End Set
End Property
End Class
<Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName:="FillRowMatches", IsDeterministic:=True, IsPrecise:=True, TableDefinition:="Occurrence int,val nvarchar(100),POS int,length int")> _
Public Shared Function RegexMatches(ByVal input As SqlChars, ByVal pattern As SqlString) As IEnumerable
Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value, optionS)
Dim t As System.Text.RegularExpressions.MatchCollection
Dim matchlist As New Collection
Dim m_info As MatchInfo
Dim i As Integer
If input.IsNull Then
m_info = New MatchInfo
m_info.value = System.DBNull.Value
m_info.position = -1
m_info.length = -1
matchlist.Add(m_info)
Else
If rex.IsMatch(input.Value) = False Then
m_info = New MatchInfo
m_info.position = 0
m_info.length = 0
m_info.value = ""
matchlist.Add(m_info)
Else
t = rex.Matches(input.Value)
For i = 0 To t.Count - 1
m_info = New MatchInfo
m_info.rank = i
m_info.value = t(i).Value
m_info.position = t(i).Index
m_info.length = t(i).Length
matchlist.Add(m_info)
Next
End If
End If
Return matchlist
rex = Nothing
matchlist = Nothing
End Function
Public Shared Sub FillRowMatches(ByVal obj As Object, <Out()> ByRef Occurrence As SqlInt32, <Out()> ByRef Val As SqlString, <Out()> ByRef Pos As SqlInt32, <Out()> ByRef Length As SqlInt32)
Val = CType(CType(obj, MatchInfo).value, SqlString)
Pos = CType(CType(obj, MatchInfo).position, SqlInt16)
Length = CType(CType(obj, MatchInfo).Length, SqlInt16)
Occurrence = CType(CType(obj, MatchInfo).rank, SqlInt16)
End Sub
End Class
The results
==============================================================================
Matt's CLR Solution for Extracting Phone Numbers
------------------------------------------------------------------------------
(4040205 row(s) affected)
00:02:02:923 Duration (hh:mi:ss:mmm)
==============================================================================
UPDATE - this time with NO other disk activity slowing things down
==============================================================================
Matt's CLR Solution for Extracting Phone Numbers
------------------------------------------------------------------------------
(4040205 row(s) affected)
00:01:46:083 Duration (hh:mi:ss:mmm)
==============================================================================
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 14, 2007 at 1:12 pm
Phone test - Phase 2 (redaction)
The CLR code is:
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports Microsoft.SqlServer.Server
Imports System.Text.RegularExpressions
Partial Public Class UserDefinedFunctions
_
Public Shared Function RegexReplace(ByVal input As SqlChars, ByVal pattern As SqlString, ByVal rep As SqlString) As SqlString
' Add your code here
Dim rex As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(pattern.Value)
If input.IsNull Then
Return New SqlString(SqlString.Null.Value)
Else
Return New SqlString(rex.Replace(New String(input.Value), rep.Value))
End If
End Function
End Class
Here's the SQL
use testing
go
dbcc freeproccache
dbcc dropcleanbuffers
--clean out the column
update mattTestText
set docred=null
go
dbcc freeproccache
dbcc dropcleanbuffers
go
--set things up
declare @StartTime datetime
PRINT REPLICATE('=',78)
PRINT SPACE(12)+'Matt''s CLR Solution for Phone Redaction'
PRINT REPLICATE('-',78)
--the clock is ticking...
set @StartTime=getdate()
update matttesttext
set docred=dbo.regexreplace(doc,'\([0-9]{3}\)(? [0-9]{4}) ','(XXX)${1}YYY-${2} ')
from matttesttext
--===== Display the duration
PRINT CONVERT(CHAR(12),GETDATE()-@StartTime,114) + ' Duration (hh:mi:ss:mmm)'
PRINT REPLICATE('=',78)
--The 100
select top 100 rid,doc,docred, len(doc),len(docred) from matttesttext
No cross apply this time, so time flies when you're having fun...
Results -
==============================================================================
Matt's CLR Solution for Phone Redaction
------------------------------------------------------------------------------
(500000 row(s) affected) -- also known as 4040205 phone numbers redacted
00:01:04:063 Duration (hh:mi:ss:mmm)
==============================================================================
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 14, 2007 at 5:55 pm
Shaun McGuile (11/14/2007)
Good idea Gail, I'm not gonna get in the way of battling giants!:D
Matt said it right... "boys with their toys"... just having a bit of fun... c'mon in... water's fine here...
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2007 at 6:00 pm
GilaMonster (11/14/2007)
Or if Jeff wants to produce the T-SQL, I can give the C# a try, but it won't be for a couple days.
Heck ya... great idea for a test, Gail...
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2007 at 6:18 pm
--the unique index - not clustered since you're not doing clustered either
Boy! Talk about a typo on my part :sick:
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2007 at 7:09 pm
GilaMonster (11/14/2007)
Matt Miller (11/14/2007)
Giants? Nah - I aint that tall....More like - boys with their toys:)
You said it. 😉
What I'll do is grab one of the string functions that I have in my main database, rewrite it in as efficient T-SQL as I can manage, then hack up a C# version, and see how they compare. Doubt it'll be before friday though. Considering that my sQL is way better than my C#, it won't be really fair
Oh, I will add in an additional challenge if you two feel like it.
Take data of the following form, with a lot, lot more data (say 100 000 rows and 250 categories)
CREATE TABLE #Test
Category INT,
Code VARCHAR(5)
)
INSERT INTO #Test VALUES (1,'a')
INSERT INTO #Test VALUES (1,'t')
INSERT INTO #Test VALUES (1,'e')
INSERT INTO #Test VALUES (1,'s')
INSERT INTO #Test VALUES (1,'u')
INSERT INTO #Test VALUES (2,'u')
INSERT INTO #Test VALUES (2,'e')
INSERT INTO #Test VALUES (2,'n')
INSERT INTO #Test VALUES (2,'a')
INSERT INTO #Test VALUES (3,'i')
INSERT INTO #Test VALUES (3,'z')
and output data of the form. (order of the characters isn't important, since there's nothing to order by)
1, 'a,t,e,s,u'
2, 'u,e,n,a'
3, 'i,z'
My gut feel is that a user-defined aggregate will out perform any T-SQL by quite a way.
Or if Jeff wants to produce the T-SQL, I can give the C# a try, but it won't be for a couple days.
There's a XML trick that might be worth throwing into the mix for this one. it's VERY slick, but I've been reading mixed reviews as to its speed on long strings. or rather, concatenation of LOTS of rows.
Of course - it's your test - you make up the rules 🙂
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 14, 2007 at 7:11 pm
Jeff Moden (11/14/2007)
--the unique index - not clustered since you're not doing clustered either
Boy! Talk about a typo on my part :sick:
It's alright - at least we're comparing apples to apples this way. It wouldn't have built it if it weren't unique, so it proved the point.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 14, 2007 at 7:47 pm
==============================================================================
Matt's CLR Solution for Extracting Phone Numbers
------------------------------------------------------------------------------
(4040205 row(s) affected)
00:01:46:083 Duration (hh:mi:ss:mmm)
==============================================================================
Ok... I might have just missed in in your code and because of the rowcount you have (I know it's random but I think it should be closer)... did you remember to NOT include the 3rd telephone number in a row?
And, I've been up for almost 40 hours... Haven't had much of a chance to work on the 2nd part of the phone number thing and would probably drool on myself a bit if I tried right now... I'll take a whack at it in 8 or 10 hours :hehe:
--Jeff Moden
Change is inevitable... Change for the better is not.
November 14, 2007 at 7:55 pm
Jeff Moden (11/14/2007)
==============================================================================
Matt's CLR Solution for Extracting Phone Numbers
------------------------------------------------------------------------------
(4040205 row(s) affected)
00:01:46:083 Duration (hh:mi:ss:mmm)
==============================================================================
Ok... I might have just missed in in your code and because of the rowcount you have (I know it's random but I think it should be closer)... did you remember to NOT include the 3rd telephone number in a row?
And, I've been up for almost 40 hours... Haven't had much of a chance to work on the 2nd part of the phone number thing and would probably drool on myself a bit if I tried right now... I'll take a whack at it in 8 or 10 hours :hehe:
:Whistling:
umh...sure...it's there - it's just the ranger eyes 🙂
Oh all right - I will rerun it!
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 14, 2007 at 11:01 pm
Matt Miller (11/14/2007)
There's a XML trick that might be worth throwing into the mix for this one. it's VERY slick, but I've been reading mixed reviews as to its speed on long strings. or rather, concatenation of LOTS of rows.Of course - it's your test - you make up the rules 🙂
I'd like to see that trick. At the moment the ways I know to do this require either a cursor or a UDA
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
November 15, 2007 at 5:47 am
Me too! Anyone know where there might be a posting for it? Or, maybe post it here so we can do some testing on it?
--Jeff Moden
Change is inevitable... Change for the better is not.
November 15, 2007 at 8:23 am
New results with the CORRECT query for the phone extraction challenge. The "no 3rd" took a chunk out of it (as expected), although the runs just seem slower today anyway.
I actually ran it two different ways. One was with the restriction in the CLR, and the other in the T-SQL. The two versions:
CLR has the restriction: 00:02:29:497 Duration (hh:mi:ss:mmm)
SQL has the restriction: 00:02:36:077 Duration (hh:mi:ss:mmm)
which kind of makes sense to me. 500,000 rows fewer received from CLR, meaning less effort in the cross apply, and fewer rows to have to throw out.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 15, 2007 at 8:27 am
Gail - as to your test - I will try to dig out and use the XML data() trick, we'll let Jeff come up with the "pure" T-SQL trick, and you can handle the CLR?
Let us know what the test data looks like, and set up whatever other rules you wish. We can have a three-way cage match (unless someone else has another version they'd like).
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 15 posts - 31 through 45 (of 62 total)
You must be logged in to reply to this topic. Login to reply