July 16, 2010 at 1:17 pm
Jeff Moden (7/16/2010)
Julie Breutzmann (7/16/2010)
I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.That's what I was afraid of. There is a way to do this same thing using an iTVF (inline Table Valued Function) and a Cross Apply that should speed things up by (possibly) and order of magnitude. The problem is that I won't be at a machine with SQL Server on it until Monday and any code that I would write would be untested until then.
This is not a rush project. I'd appreciate anything you can come up with when you're able to get to it.
Julie
July 16, 2010 at 1:42 pm
I'll try to remember it on Monday, then, Julie. Thanks for your patience.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 19, 2010 at 8:37 am
Jeff Moden (7/15/2010)
Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
Maybe in this case... maybe not. I've beaten certain CLR functions.
Any examples? In "some" degree it depends on the developer proficience in C#...
But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)
July 19, 2010 at 4:08 pm
Eugene Elutin (7/19/2010)
Jeff Moden (7/15/2010)
Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
Maybe in this case... maybe not. I've beaten certain CLR functions.
Any examples? In "some" degree it depends on the developer proficience in C#...
But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)
Unfortunately, no. Matt Miller and I went round and round a long time ago and I actually beat some "Regex" CLR's but I can't put my hands on those particular threads anymore. I'm not sure Matt can either but I believe he'll vouch for my claims.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 19, 2010 at 5:35 pm
Jeff Moden (7/19/2010)
Eugene Elutin (7/19/2010)
Jeff Moden (7/15/2010)
Eugene Elutin (7/14/2010)
Julie Breutzmann (7/14/2010)
...What would be an efficient way to code this?
...
If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...
Maybe in this case... maybe not. I've beaten certain CLR functions.
Any examples? In "some" degree it depends on the developer proficience in C#...
But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)
Unfortunately, no. Matt Miller and I went round and round a long time ago and I actually beat some "Regex" CLR's but I can't put my hands on those particular threads anymore. I'm not sure Matt can either but I believe he'll vouch for my claims.
I believe you. However, Regex is not the fasterst way to do thing in C#, short and elegant - yes, fast - not really - too havy library. The fasterst string manipulations would be byte by byte. Not so elegant, a lot of manual coding, but -very fast. Lets say this: it's not easy to beat proper implementation in C#. It may be possible and I have heard such claims, but have not seen good example yet...
One more thing, just as example. In C#, Replace function is case sensitive. To do case-insensitive replace, Regex is often used. However the faster option is to do such replacement char-by-char (byte-by-byte). You will be surprised: this will beat in-build case-sensitive Replace function as well in all cases except when nothing to replace:
http://www.codeproject.com/KB/string/fastestcscaseinsstringrep.aspx
Would be intresting to compare it with SQL Replace...
July 20, 2010 at 9:20 pm
Eugene Elutin (7/19/2010)
The fasterst string manipulations would be byte by byte. Not so elegant, a lot of manual coding, but -very fast.
Now, I'll definitely agree with THAT! Done correctly, it will be very fast, indeed! I also agree that it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 21, 2010 at 1:40 pm
I agree that this "may" perform better as CLR; however, if I were stuck doing this in TSQL. I would create a blacklist table, that housed the characters I wanted to scrub and its replacement value. This gives you a lot more flexibility and makes the process easier to maintain. You could potentially do something like this. I have not tested large scale performance but this should be better than a scalar function.
*I am assuming that you have a numbers table*
USE [tempdb]
GO
SET NOCOUNT ON;
GO
CREATE TABLE dbo.Blacklist(ExlChar CHAR(1),RepChar CHAR(1));
INSERT INTO Blacklist VALUES ('.','')
INSERT INTO Blacklist VALUES (' ','')
INSERT INTO Blacklist VALUES ('-','')
INSERT INTO Blacklist VALUES ('_','')
INSERT INTO Blacklist VALUES ('~','')
INSERT INTO Blacklist VALUES ('ž','z')
INSERT INTO Blacklist VALUES ('ó','o')
INSERT INTO Blacklist VALUES ('š','s')
INSERT INTO Blacklist VALUES ('í','i')
INSERT INTO Blacklist VALUES ('é','e')
--Create sample table
CREATE TABLE dbo.T(Id INT IDENTITY,col VARCHAR(50));
INSERT INTO t VALUES ('Haínes')
INSERT INTO t VALUES ('Haines')
INSERT INTO t VALUES (' -Haines')
INSERT INTO t VALUES ('-Haines ')
GO
--Create an Inline TVF function
CREATE FUNCTION dbo.fn_CleanStr()
RETURNS TABLE
RETURN(
SELECT
t1.Id,
t1.Col,
(
SELECT LTRIM(RTRIM(COALESCE([RepChar],SUBSTRING(LTRIM(RTRIM(col)),n,1))))
FROM t t2
INNER JOIN dbo.Numbers n ON n <= LEN(LTRIM(col))
LEFT JOIN dbo.Blacklist b ON [ExlChar] = SUBSTRING(LTRIM(RTRIM(col)),n,1)
WHERE t2.Id = t1.Id
ORDER BY t2.Id, t2.[col], n.n
FOR XML PATH(''), TYPE
).value('.','varchar(max)') AS CleanStr
FROM t t1
)
GO
--Begin solution
SELECT fn.*
FROM dbo.fn_CleanStr() AS fn
WHERE fn.CleanStr = 'Haines'
July 22, 2010 at 5:51 am
Jeff Moden (7/20/2010)
...it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.
In a simple race single T-SQL REPLACE outperforms CLR simple byte-by-byte version : it is 3 times faster. Which is not suprising as CLR call not coming cheap enough...
However, for multiple replacements on the same string dedicated CLR replace function which handles multiple replacements in a single call catching up with T-SQL one. After number of replacements reached 15 - CLR starts constantly outperform REPLACE.
Also, I beleive that CLR function (based on the code from the link supplied in my previos post) can be tuned further...
using System;
using System.Collections.Generic;
using System.Text;
using System.Data.SqlClient;
using System.Data.SqlTypes;
namespace QuickReplaceX
{
public class ReplaceX
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString Replace(string original, string pattern, string replacement)
{
return (SqlString)doReplace(original, pattern, replacement);
}
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString ReplaceMltPattern(string original, string patterns, string replacement, char separator)
{
string res = original;
string[] aPatterns = patterns.Split(separator);
for (int i = 0; i < aPatterns.Length; i++)
{
res = doReplace(res, aPatterns, replacement);
}
return (SqlString) res;
}
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString ReplaceMltAll(string original, string patterns, string replacements, char separator)
{
string res = original;
string[] aPatterns = patterns.Split(separator);
string[] aReplacements = replacements.Split(separator);
if (aPatterns.Length != aReplacements.Length)
{
throw new ApplicationException("ReplaceMltAll: number of patterns must match number of replacements");
}
for (int i = 0; i < aPatterns.Length; i++)
{
res = doReplace(res, aPatterns, aReplacements);
}
return (SqlString)res;
}
private static string doReplace(string original, string pattern, string replacement)
{
int count, position0, position1;
count = position0 = position1 = 0;
string upperString = original.ToUpper();
string upperPattern = pattern.ToUpper();
int inc = (original.Length / pattern.Length) *
(replacement.Length - pattern.Length);
char[] chars = new char[original.Length + Math.Max(0, inc)];
while ((position1 = upperString.IndexOf(upperPattern,
position0)) != -1)
{
for (int i = position0; i < position1; ++i)
chars[count++] = original;
for (int i = 0; i < replacement.Length; ++i)
chars[count++] = replacement;
position0 = position1 + pattern.Length;
}
if (position0 == 0) return original;
for (int i = position0; i < original.Length; ++i)
chars[count++] = original;
return new string(chars, 0, count);
}
}
}
and to register:
CREATE ASSEMBLY [QuickReplaceX]
AUTHORIZATION [dbo]
FROM '[full path]\QuickReplaceX.dll'
WITH PERMISSION_SET = SAFE
GO
CREATE FUNCTION QReplace(@original NVARCHAR(MAX), @pattern NVARCHAR(MAX), @replacement NVARCHAR(MAX))
RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[Replace]
GO
CREATE FUNCTION QReplaceMltPattern(@original NVARCHAR(MAX), @patterns NVARCHAR(MAX), @replacement NVARCHAR(MAX), @separator nchar(1) )
RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[ReplaceMltPattern]
GO
CREATE FUNCTION QReplaceMltAll(@original NVARCHAR(MAX), @patterns NVARCHAR(MAX), @replacements NVARCHAR(MAX), @separator nchar(1))
RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[ReplaceMltAll]
GO
Dont forget to enable CLR 🙂
July 22, 2010 at 1:31 pm
Thank you Adam. The function that I posted is actually using a partial blacklist table, so I'll look at merging our thoughts.
Thanks Eugene for your assistance. I'll try to understand and implement your solution, also.
It may be a while before I have a chance to explore these options. I am suddenly swamped with other things of higher priority. I will post my results back to this forum when I am able to do so.
Julie
July 22, 2010 at 4:23 pm
Eugene Elutin (7/22/2010)
Jeff Moden (7/20/2010)
...it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.
In a simple race single T-SQL REPLACE outperforms CLR simple byte-by-byte version : it is 3 times faster. Which is not suprising as CLR call not coming cheap enough...
However, for multiple replacements on the same string dedicated CLR replace function which handles multiple replacements in a single call catching up with T-SQL one. After number of replacements reached 15 - CLR starts constantly outperform REPLACE.
Also, I beleive that CLR function (based on the code from the link supplied in my previos post) can be tuned further...
Thanks for the testing Eugene. 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 10 posts - 16 through 24 (of 24 total)
You must be logged in to reply to this topic. Login to reply