May 21, 2009 at 6:53 am
Florian Reischl (5/20/2009)
@J-F:I stopped your function at two minutes.
Greets
Flo
Ouch... And I thought it would perform better... Do you have the exec plan Flo, just for reference.
Thanks,
Cheers,
J-F
May 21, 2009 at 7:20 am
First of all, thanks every one for your help so far. Its clear that I need to learn a few more things about performance of functions.
Let me post some findings after using the SET STATISTICS. I simplified the problem to a small query calling a ID_field and phone from a table that is 170mln records (and yes, we receive data extracts at 6-8 week intervals requiring me to redo all the validations on data such as phone numbers etc.).
See queries at bottom of message
Time stats:
Query 1: CPU time = 655 ms, elapsed time = 406 ms.
Query 2 (Scaler): CPU time = 11562 ms, elapsed time = 22861 ms.
Query 3 (CLR, C#): CPU time = 5281 ms, elapsed time = 7296 ms.
IO stats:
Query 1: Scan count 1, logical reads 0, physical reads 0, read-ahead reads 23786, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Query 2: Scan count 1, logical reads 16824, physical reads 0, read-ahead reads 21784, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Query 3: Scan count 1, logical reads 0, physical reads 0, read-ahead reads 22378, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Discussion:
The data indicates that functions carry a substantial performance hit. The CLR appears most efficient but it will still take ~18x longer processing time. In the literature, I found that the processing time scale linearly with the number of records. My queries without any functions routinely take about 20 hours, this wold indicate that adding a function would extent the processing time from 20 to 360 hours (that is 15 days)! Any thoughts about that assumption??
One more tidbit, when invoking a scalar function and just pass the input to the output, the CPU processing time increased the query ~12x (data not shown), that was quite an eye opener.
I have not tried the tables-valued function that was presented because I need to learn about the m first, just never looked into it and will do now.
Thanks for your help guys, looking forward to your comments.
Queries
--SET STATISTICS IO on
SET STATISTICS time on
--Query 1; baseline (no function)
SELECT TOP 100000
mita_orig_id,
isnull(PHONE,'') as new_phone
FROM dbo.input_table
WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0
dbcc freeproccache
dbcc dropcleanbuffers
-- Query 2; Scalar-valued function
SELECT TOP 100000
mita_orig_id,
PDW_DL_DATA.DBO.VALID_PHONE(isnull(PHONE,'')) as new_phone
FROM dbo.input_table
WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0
dbcc freeproccache
dbcc dropcleanbuffers
-- Query 3; CLR function (C#)
SELECT TOP 100000
mita_orig_id,
PDW_DL_DATA.DBO.UDF_VALID_PHONE(isnull(PHONE,'')) as new_phone
FROM dbo.input_table
WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0
SET STATISTICS IO off
SET STATISTICS time off
May 21, 2009 at 8:05 am
J-F Bergeron (5/21/2009)
Florian Reischl (5/20/2009)
@J-F:I stopped your function at two minutes.
Greets
Flo
Ouch... And I thought it would perform better... Do you have the exec plan Flo, just for reference.
Thanks,
I just tried again on my home system which performs much better than my box at work. Now the inline function has almost same duration as the scalar function but the CPU usage is much more.
Here my duration results:
---====== SQL scalar ===================
SQL Server Execution Times:
CPU time = 19890 ms, elapsed time = 21106 ms.
---====== SQL inline ===================
SQL Server Execution Times:
CPU time = 65084 ms, elapsed time = 21283 ms.
---====== CLR scalar ===================
SQL Server Execution Times:
CPU time = 16926 ms, elapsed time = 5782 ms.
The execution plans are attached.
I will post the CLR function and the test environment in a separate post.
May 21, 2009 at 8:15 am
Since you didn't answer my question for C# or VB.Net I use C# because this is my preference 😀
Ensure that you use "Release" compiled assembly. The "Debug" build is about 30% slower.
The CLR function:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using System.Text;
using System.Text.RegularExpressions;
using System.Xml;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
private static readonly Regex _phoneRegEx = new Regex("[(]|[)]|[-]|[ ]", RegexOptions.Compiled);
[Microsoft.SqlServer.Server.SqlFunction()]
public static SqlString ufn_clr_ValidPhone(
[SqlFacet(MaxSize = 4000)]
SqlString strIn
)
{
if (strIn.IsNull)
return new SqlString("");
string str = _phoneRegEx.Replace(strIn.Value, "");
string item;
if (str.Length == 11)
{
if (str[0] == '1')
str = str.Substring(1);
else
return new SqlString("");
}
if (str.Length != 10)
return new SqlString("");
if (str[0] == '0' || str[0] == '1')
return new SqlString("");
if (str[1] == '9')
return new SqlString("");
if (str.Substring(1, 2) == "11")
return new SqlString("");
if (str[3] == '0' || str[3] == '1')
return new SqlString("");
if (str.Substring(4, 2) == "11")
return new SqlString("");
item = str.Substring(0, 3);
if (item == "456" || item == "500" || item == "555" || item == "600" || item == "700" || item == "710" || item == "800" || item == "822" || item == "833" || item == "844" || item == "855" || item == "866" || item == "877" || item == "888" || item == "898" || item == "900" || item == "976" || item == "999")
return new SqlString("");
item = str.Substring(3);
if (item == "2222222" || item == "3333333" || item == "4444444" || item == "5555555" || item == "6666666" || item == "7777777" || item == "8888888" || item == "9999999")
return new SqlString("");
return new SqlString(str);
}
};
I currently use a RegEx just to remove the special characters. I will try to cover the complete function within a reg-ex. I'm no great RegEx guy but I'll try. Are there any Perl pros? :laugh:
To reduce the query execution time to the real phone validation I moved the functions to the WHERE clause.
[font="Courier New"]SET NOCOUNT ON
GO
---======================================
-- -> Create some test data for phone number validation
--DROP TABLE PhoneNumbers
IF (OBJECT_ID('dbo.PhoneNumbers') IS NULL)
BEGIN
CREATE TABLE dbo.PhoneNumbers
(
Id INT NOT NULL IDENTITY,
Phone VARCHAR(1000),
PRIMARY KEY CLUSTERED
(Id)
)
--TRUNCATE TABLE PhoneNumbers
INSERT INTO PhoneNumbers
SELECT TOP(1000000)
CASE WHEN ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 10 = 0 THEN '0' ELSE '1' END +
'(' + LEFT(CONVERT(VARCHAR(20), ABS(CHECKSUM(NEWID()))), 3) + ') ' +
LEFT(CONVERT(VARCHAR(20), ABS(CHECKSUM(NEWID()))), 3) + '-' +
LEFT(CONVERT(VARCHAR(20), ABS(CHECKSUM(NEWID()))), 4)
FROM MASTER.sys.all_columns c1
CROSS JOIN MASTER.sys.all_columns c2
ALTER INDEX ALL ON PhoneNumbers REBUILD
END
-- <- Create some test data for phone number validation
---======================================
GO
DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS
GO
---====== SQL scalar ===================
PRINT '---====== SQL scalar ==================='
DECLARE @count INT
SET STATISTICS TIME ON
SELECT
@count = COUNT(*)
FROM PhoneNumbers
WHERE
dbo.Valid_Phone(Phone) != ''
SET STATISTICS TIME OFF
PRINT ''
PRINT ''
GO
DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS
GO
---====== SQL inline ===================
PRINT '---====== SQL inline ==================='
DECLARE @count INT
SET STATISTICS TIME ON
SELECT
@count = COUNT(*)
FROM PhoneNumbers
CROSS APPLY dbo.Valid_Phone_Inline(Phone) l
WHERE
l.OutputPhone != ''
SET STATISTICS TIME OFF
PRINT ''
PRINT ''
GO
DBCC FREEPROCCACHE WITH NO_INFOMSGS
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS
GO
---====== CLR scalar ===================
PRINT '---====== CLR scalar ==================='
DECLARE @count INT
SET STATISTICS TIME ON
SELECT
@count = COUNT(*)
FROM PhoneNumbers
WHERE
dbo.ufn_clr_ValidPhone(Phone) != ''
SET STATISTICS TIME OFF
PRINT ''
PRINT ''
[/font]
@J-F: Thanks a lot for the phone number samples since I had no idea about US phone numbers!
Greets
Flo
May 21, 2009 at 8:22 am
Hi Pieter
Pieter (5/21/2009)
First of all, thanks every one for your help so far. Its clear that I need to learn a few more things about performance of functions.
Always welcome. Those kind of threads are the most interesting in my opinion 🙂
Time stats:
Query 1: CPU time = 655 ms, elapsed time = 406 ms.
Query 2 (Scaler): CPU time = 11562 ms, elapsed time = 22861 ms.
Query 3 (CLR, C#): CPU time = 5281 ms, elapsed time = 7296 ms.
Could you please try my CLR function?
Queries
--SET STATISTICS IO on
SET STATISTICS time on
--Query 1; baseline (no function)
SELECT TOP 100000
mita_orig_id,
isnull(PHONE,'') as new_phone
FROM dbo.input_table
WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0
dbcc freeproccache
dbcc dropcleanbuffers
-- Query 2; Scalar-valued function
SELECT TOP 100000
mita_orig_id,
PDW_DL_DATA.DBO.VALID_PHONE(isnull(PHONE,'')) as new_phone
FROM dbo.input_table
WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0
dbcc freeproccache
dbcc dropcleanbuffers
-- Query 3; CLR function (C#)
SELECT TOP 100000
mita_orig_id,
PDW_DL_DATA.DBO.UDF_VALID_PHONE(isnull(PHONE,'')) as new_phone
FROM dbo.input_table
WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0
SET STATISTICS IO off
SET STATISTICS time off
Maybe there are some further issues with your statement. Just to be sure could you please try to replace your "LEN(x) 0" with "x ''" statements and try to call the functions without the ISNULL()?
Thanks
Flo
May 21, 2009 at 8:35 am
For some info on scalar functions, including the difficulting in seeing how they perform
http://sqlinthewild.co.za/index.php/2009/04/29/functions-io-statistics-and-the-execution-plan/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
May 21, 2009 at 8:53 am
GilaMonster (5/21/2009)
For some info on scalar functions, including the difficulting in seeing how they performhttp://sqlinthewild.co.za/index.php/2009/04/29/functions-io-statistics-and-the-execution-plan/
Gail, generally you are correct. But there is a huge difference between Pieters scalar function and the one you use in your blog. It does not query any data.
Maybe the performance could be optimized with a direct CTE which removes the special characters and a does the SUBSTRING checks. I did not yet try this.
I just did not yet get an answer why Pieter doesn't add a new column to his table which contains the already formatted phone numbers.
Greets
Flo
May 21, 2009 at 8:58 am
Florian Reischl (5/21/2009)
Gail, generally you are correct. But there is a huge difference between Pieters scalar function and the one you use in your blog. It does not query any data.
True but, as far as I know, it will still be getting called on each execution (use profiler to check) and there's still overhead on each call. No where near as significant as when there's data queried, but still there.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
May 21, 2009 at 9:04 am
GilaMonster (5/21/2009)
Florian Reischl (5/21/2009)
Gail, generally you are correct. But there is a huge difference between Pieters scalar function and the one you use in your blog. It does not query any data.True but, as far as I know, it will still be getting called on each execution (use profiler to check) and there's still overhead on each call. No where near as significant as when there's data queried, but still there.
Completely agreed.
May 21, 2009 at 9:05 am
I've tried using scalar functions that don't access data and Gail is correct that most of the time they perform poorly.
I've tried to break this query down for the mil rows supplied further up as test data.
I'm trying to split out all the substring lefts etc and do comparisons at the end on a CTE and it performs very badly at the moment but still working on it 🙂
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
May 21, 2009 at 9:11 am
IS this query only supposed to return the correct phone numbers?
If this is case I think I will work on a short circuit case statement in the where clause to do that might make it faster than having to do a function ...
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
May 21, 2009 at 11:51 am
J-F,
Although I haven't worked with table-value functions, I am not sure that they work for my scenario. The validated phone number needs to be linked up to the original data, therefore at minimum I need a result giving me ID_field and validated_phone.
May 21, 2009 at 12:21 pm
Florian,
thanks for submitting the CLR (C# is fine) I compared it against mine and a baseline query without the function and found the following, see below. Looks like we use similar CPU times and you have more read-ahead reads. I am still trying to figure what that means.
I will however have to continue this tomorrow, but I look forward to the comments.
Thanks you all for your patience.
Pieter
-- Baseline query (no functions)
CPU time = 388 ms, elapsed time = 487 ms.
-- Piet's CLR
CPU time = 5341 ms, elapsed time = 6771 ms.
-- Florian Reischl CLR
CPU time = 4978 ms, elapsed time = 6887 ms.
-- Baseline query (no functions)
Scan count 1,
logical reads 0,
physical reads 0,
read-ahead reads 1879,
lob logical reads 0,
lob physical reads 0,
lob read-ahead reads 0.
-- Piet's CLR
Scan count 1,
logical reads 0,
physical reads 0,
read-ahead reads 3090,
lob logical reads 0,
lob physical reads 0,
lob read-ahead reads 0.
-- Florian Reischl CLR
Scan count 1,
logical reads 0,
physical reads 0,
read-ahead reads 22506,
lob logical reads 0,
lob physical reads 0,
lob read-ahead reads 0.
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
// This function will validate phone numbers and correct minor problems.
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString UDF_VALID_PHONE(String pInput)
{
String vInput; //Place holder for the input
String vOutput; //Place holder for the output
String AreaCode;
String PhoneNumber;
// Assign input variable
vInput = pInput;
// Remove punctuations
vInput = vInput.Replace("(", "");
vInput = vInput.Replace(")", "");
vInput = vInput.Replace("-", "");
vInput = vInput.Replace(".", "");
vInput = vInput.Replace(" ", "");
//Validate the input string, remove 1 for long distance
if (vInput.Length == 11)
{
// If input length is 11 and start with 1 then remove the first byte
vInput = vInput.Substring(1, 10);
}
// further evaluate the string. If its too long or too short make it blank
if (vInput.Length > 11)
{
vInput = "";
}
else if (vInput.Length < 10)
{
vInput = "";
}
//Validate the improved string that is 10 bytes
else if ((vInput.Length == 10) && (Regex.IsMatch(vInput, @"[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]") == false))
{
// if input length = 10 but not numeric, discard the input
vInput = "";
}
////////////////////////////////////////////////////////////////////////////////////////
// validate whether the new string has valid area code (first 3 bytes)
else if (vInput.Length == 10)
{
// validate whether the new string has valid area code (first 3 bytes)
AreaCode = vInput.Substring(0, 3);
//Validate secod part of phone number
PhoneNumber = vInput.Substring(3, 7);
if ((AreaCode == "456")
|| (AreaCode == "500")
|| (AreaCode == "555")
|| (AreaCode == "600")
|| (AreaCode == "700")
|| (AreaCode == "710")
|| (AreaCode == "800")
|| (AreaCode == "822")
|| (AreaCode == "833")
|| (AreaCode == "844")
|| (AreaCode == "855")
|| (AreaCode == "866")
|| (AreaCode == "877")
|| (AreaCode == "888")
|| (AreaCode == "898")
|| (AreaCode == "900")
|| (AreaCode == "976")
|| (AreaCode == "999")
|| (PhoneNumber == "2222222")
|| (PhoneNumber == "3333333")
|| (PhoneNumber == "4444444")
|| (PhoneNumber == "5555555")
|| (PhoneNumber == "6666666")
|| (PhoneNumber == "7777777")
|| (PhoneNumber == "8888888")
//|| (PhoneNumber == "9999999")
|| (vInput.Substring(0, 1) == "0") // check first byte
|| (vInput.Substring(0, 1) == "1") // check first byte
|| (vInput.Substring(1, 1) == "9") // check second byte
|| (vInput.Substring(1, 2) == "11") // check first byte
|| (vInput.Substring(3, 1) == "0") // check third byte
|| (vInput.Substring(3, 1) == "1") // check first byte
|| (vInput.Substring(4, 2) == "11") // check first byte
)
{
vInput = "";
}
}
//Assign modified vInput to vOutput
vOutput = vInput;
// Put your code here
return vOutput ;
}
};
May 21, 2009 at 12:48 pm
Hi Piet
I just figured out. The problem of my (and your) CLR was the RegEx usage. It's much slower than your simple string.Replace. I tried your function and it was much faster than mine. After I removed the RegEx my function seems to be 3 times faster than yours. This depends on your RegEx.Match.
Replace the RegEx.Replace line of my code with this:
//string str = _phoneRegEx.Replace(strIn.Value, "");
string str = strIn.Value;
str = str.Replace("(", "");
str = str.Replace(")", "");
str = str.Replace("-", "");
str = str.Replace(" ", "");
I forgot the check if everything is numeric. Place the following code after the check for "str.Length != 10":
if (str.Length != 10)
return new SqlString("");
for (int i = 0; i < str.Length; i++)
{
if (str < '0' || str > '9')
return new SqlString("");
}
... I tried a long.Parse but it seems to be slower. Don't use a foreach enumerator because this is also slower than a index based for-loop.
An additional think I figured out in some other SQL CLR tests:
Use the SqlString instead of the Syste.String. SQL Server seems to perform much better with this.
Greets
Flo
May 21, 2009 at 1:39 pm
Pieter (5/21/2009)
J-F,Although I haven't worked with table-value functions, I am not sure that they work for my scenario. The validated phone number needs to be linked up to the original data, therefore at minimum I need a result giving me ID_field and validated_phone.
Pieter, the reason you are not able to link it to your data, is probably because you need to use the "Outer Apply" function, to get your new phone results in your query. You can also use the "Cross Apply", but in a simple select, you don't need to ensure the data is correct, just get it, so Outer apply will work perfectly.
You can see Outer Apply as a left join, and Cross Apply as an Inner Join, for your information.
Now, I think Florian has tested my function, and it does not seem to perform well, sometimes, Inline functions are more effective then Scalar, but since there is no data-access on your function, that might not be the case.
Anyway, you can still test it, you can do something like this:
select Phone from PhoneNumbers outer apply dbo.[Valid_Phone_Inline](Phone)
Thanks,
Cheers,
J-F
Viewing 15 posts - 16 through 30 (of 41 total)
You must be logged in to reply to this topic. Login to reply