November 20, 2012 at 6:03 pm
Eugene Elutin (11/20/2012)
Another one from Lynn Pettis, for some reason he couldn't post it himself...
declare @email varchar(100) = 'Doe.John@CompanyABC.com'
SELECT
SUBSTRING(@email, 1, CHARINDEX('.', @email) - 1),
SUBSTRING(@email, CHARINDEX('.', @email) + 1, CHARINDEX('@', @email) - CHARINDEX('.', @email) - 1),
RIGHT(@email,LEN(@email) - CHARINDEX('@', @email) + 1),
SUBSTRING(@email, CHARINDEX('.', @email) + 1, CHARINDEX('@', @email) - CHARINDEX('.', @email) - 1) + '.' +
SUBSTRING(@email, 1, CHARINDEX('.', @email) - 1) +
RIGHT(@email,LEN(@email) - CHARINDEX('@', @email) + 1),
Kept getting an error while trying to post from my phone. Third time is when I decided to try sending it to you, Eugene. That worked with no problem. Thank you.
November 20, 2012 at 10:35 pm
My version of code is (little bit changed)
SET STATISTICS TIME ON
PRINT 'Karthik'
SELECT @Email = SUBSTRING(email,CHARINDEX('.',email)+1,CHARINDEX('@',email)- CHARINDEX('.',email)-1)+ '.' +
REPLACE(email,'.'+ SUBSTRING(email,CHARINDEX('.',email)+1,CHARINDEX('@',email)- CHARINDEX('.',email)-1),'')
FROM email
SET STATISTICS TIME OFF
declare @Email varchar(250)
set statistics time on
PRINT 'Eugene Eletin'
SET STATISTICS TIME ON
select @Email = SUBSTRING(email, CHARINDEX('.', email)+1,CHARINDEX('@', email)-CHARINDEX('.', email)-1)
+ '.'
+ SUBSTRING(email, 0, CHARINDEX('.', email))
+ SUBSTRING(email, CHARINDEX('@', email),LEN(email))
FROM email
SET STATISTICS TIME OFF
PRINT 'Dwain.C'
SET STATISTICS TIME ON
SELECT @Email=RIGHT(
STUFF(email
,CHARINDEX('@', email)
,1, '.' + LEFT(email, CHARINDEX('.', email)-1) + '@')
,LEN(email))
FROM email
SET STATISTICS TIME OFF
SET STATISTICS TIME ON
PRINT 'Karthik'
SELECT @Email = SUBSTRING(email,CHARINDEX('.',email)+1,CHARINDEX('@',email)- CHARINDEX('.',email)-1)+ '.' +
REPLACE(email,'.'+ SUBSTRING(email,CHARINDEX('.',email)+1,CHARINDEX('@',email)- CHARINDEX('.',email)-1),'')
FROM email
SET STATISTICS TIME OFF
------------------------ Execute ------------------------
Eugene Eletin
Execution Time 156.
Adaptive Server cpu time: 15600 ms. Adaptive Server elapsed time: 15670 ms.
(3250809 rows affected)
Dwain.C
Execution Time 112.
Adaptive Server cpu time: 11200 ms. Adaptive Server elapsed time: 11143 ms.
(3250809 rows affected)
Karthik
Execution Time 158.
Adaptive Server cpu time: 15800 ms. Adaptive Server elapsed time: 15873 ms.
(3250809 rows affected)
------------------------- Done --------------------------
I used 3250809 rows for this testing.
Dwain's horse (method) is still running in the first place in the race.
karthik
November 21, 2012 at 1:28 am
Modded version of Dwains, much quicker
PRINT 'Dwain.C COLLATE'
SET STATISTICS TIME ON
SELECT @Email=RIGHT(
STUFF(email
,CHARINDEX('@', email COLLATE Latin1_General_BIN2)
,1, '.' + LEFT(email COLLATE Latin1_General_BIN2, CHARINDEX('.', email COLLATE Latin1_General_BIN2)-1) + '@')
,LEN(email))
FROM #Emails
SET STATISTICS TIME OFF
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537November 21, 2012 at 1:40 am
Mark-101232 (11/21/2012)
Modded version of Dwains, much quicker
PRINT 'Dwain.C COLLATE'
SET STATISTICS TIME ON
SELECT @Email=RIGHT(
STUFF(email
,CHARINDEX('@', email COLLATE Latin1_General_BIN2)
,1, '.' + LEFT(email COLLATE Latin1_General_BIN2, CHARINDEX('.', email COLLATE Latin1_General_BIN2)-1) + '@')
,LEN(email))
FROM #Emails
SET STATISTICS TIME OFF
Indeed! Now it is throwing times like this:
(1000000 row(s) affected)
Eugene Eletin
SQL Server Execution Times:
CPU time = 3719 ms, elapsed time = 3740 ms.
Scott Pletcher
SQL Server Execution Times:
CPU time = 4906 ms, elapsed time = 4940 ms.
Dwain.C
SQL Server Execution Times:
CPU time = 2219 ms, elapsed time = 2231 ms.
Dwain.C COLLATE
SQL Server Execution Times:
CPU time = 1875 ms, elapsed time = 1883 ms.
Good on you Mark! I never seem to be able to predict when the BIN COLLATE helps or not. Obviously, almost always on REPLACE. But sometimes on CHARINDEX I see no difference so I didn't think to include it on this one.
Some of the "prettiness" factor was lost though. 😛
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
November 21, 2012 at 2:50 am
Prettiness and speed are apparently not conflicting requirements:
PRINT 'Dwain.C COLLATE + CROSS APPLY'
SET STATISTICS TIME ON
SELECT @Email=RIGHT(
STUFF(a.email
,CHARINDEX('@', a.email)
,1, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1) + '@')
,LEN(a.email))
FROM #Emails
CROSS APPLY (
SELECT email COLLATE Latin1_General_BIN2) a(email)
SET STATISTICS TIME OFF
(1000000 row(s) affected)
Eugene Eletin
SQL Server Execution Times:
CPU time = 3875 ms, elapsed time = 3875 ms.
Scott Pletcher
SQL Server Execution Times:
CPU time = 4594 ms, elapsed time = 4582 ms.
Dwain.C
SQL Server Execution Times:
CPU time = 2266 ms, elapsed time = 2269 ms.
Dwain.C COLLATE
SQL Server Execution Times:
CPU time = 1875 ms, elapsed time = 1876 ms.
Dwain.C COLLATE + CROSS APPLY
SQL Server Execution Times:
CPU time = 1875 ms, elapsed time = 1867 ms.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
November 21, 2012 at 3:13 am
PRINT 'Dwain.C COLLATE + CROSS APPLY'
SET STATISTICS TIME ON
SELECT @Email=RIGHT(
STUFF(a.email
...
SELECT email COLLATE Latin1_General_BIN2) a(email)
SET STATISTICS TIME OFF
...
Interesting! Especially because when I was doing my performance testing, I have also came up with exactly the same idea of using STUFF! But it was slowest of all!
Now, looks like use of COLLATE makes things much faster. I wonder what will happen if the same COLLATE is added to all versions?
Also, if I have time today, I'll create CLR one...
November 21, 2012 at 3:21 am
Eugene Elutin (11/21/2012)
PRINT 'Dwain.C COLLATE + CROSS APPLY'
SET STATISTICS TIME ON
SELECT @Email=RIGHT(
STUFF(a.email
...
SELECT email COLLATE Latin1_General_BIN2) a(email)
SET STATISTICS TIME OFF
...
Interesting! Especially because when I was doing my performance testing, I have also came up with exactly the same idea of using STUFF! But it was slowest of all!
Now, looks like use of COLLATE makes things much faster. I wonder what will happen if the same COLLATE is added to all versions?
Also, if I have time today, I'll create CLR one...
I tried adding the COLLATE to yours and Scott's and it narrows the gap slightly, but my dog is still leading the pack.
I've had good luck using STUFF in the past and I also think LEFT/RIGHT is more efficient than SUBSTRING for some reason.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
November 21, 2012 at 5:29 am
Here we go!
First of all, I would like to mention that SET STATISTICS TIME ON and OFF, is not the best for measuring time execution (J.Moden pointed it sometimes ago with explanations...).
So, I will use just DATEDIFF.
Here is my CLR, very basic, using Substrings:
[Microsoft.SqlServer.Server.SqlFunction]
public static String ReformatEmail(String str)
{
if (str == null) return null;
int fd = str.IndexOf('.');
int at = str.IndexOf('@');
if (fd < 0 || fd > at) return str;
return str.Substring(fd+1,at-fd-1) + "." +
str.Substring(0,fd) +
str.Substring(at);
}
Test setup (100,000 rows):
--===== Conditionally drop the test table to make reruns in SSMS easier.
IF OBJECT_ID('tempdb..#TestTable','U') IS NOT NULL
DROP TABLE #TestTable
;
--===== Create and populate the test table with test data.
SELECT TOP 100000
RowNum = IDENTITY(INT,1,1),
Email = STUFF(
'abcdefghijklmnopqrstuvwxyz@abcdefghijklmnopqrstuvw.com',
ABS(CHECKSUM(NEWID()))%20+3, 1, '.')
INTO #TestTable
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
;
GO
Test itself:
DECLARE @Start DATETIME, @End DATETIME
DECLARE @var varchar(max)
PRINT 'Using SUBSTRING and CHARINDEX'
SET @Start = GETUTCDATE()
select @var = SUBSTRING(Email, P.FD+1,P.AT-P.FD-1)
+ '.' + LEFT(Email, P.FD)
+ SUBSTRING(Email, P.AT,LEN(Email))
FROM #TestTable
CROSS APPLY (SELECT CHARINDEX('.', Email) AS FD, CHARINDEX('@', Email) AS AT ) P
PRINT DATEDIFF(ms,@start,GETUTCDATE())
PRINT 'Using PARSENAME'
SET @Start = GETUTCDATE()
select @var = PARSENAME(left(email, charindex('@', email) - 1), 1) + '.' +
PARSENAME(left(email, charindex('@', email) - 1), 2) +
substring(email, charindex('@', email), len(email))
FROM #TestTable
PRINT DATEDIFF(ms,@start,GETUTCDATE());
PRINT 'Using PARSENAME with CROSS APPLY'
SET @Start = GETUTCDATE()
select @var = PARSENAME(left(email, AT - 1), 1) + '.' +
PARSENAME(left(email, AT - 1), 2) +
SUBSTRING(email, AT, len(email))
FROM #TestTable
CROSS APPLY (SELECT CHARINDEX('@', Email)) P(AT)
PRINT DATEDIFF(ms,@start,GETUTCDATE());
PRINT 'Using STUFF (Dwain)'
SET @Start = GETUTCDATE()
SELECT @var=RIGHT(
STUFF(a.email
,CHARINDEX('@', a.email)
,1, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1) + '@')
,LEN(a.email))
FROM #TestTable
CROSS APPLY (SELECT email COLLATE Latin1_General_BIN2) a(email)
PRINT DATEDIFF(ms,@start,GETUTCDATE());
PRINT 'Using CLR'
SET @Start = GETUTCDATE()
select @var =[dbo].ReformatEmail(email)
FROM #TestTable
PRINT DATEDIFF(ms,@start,GETUTCDATE());
Let see results:
Using SUBSTRING and CHARINDEX
1983
Using PARSENAME
1873
Using PARSENAME with CROSS APPLY
1923
Using STUFF (Dwain)
260
Using CLR
343
Heh, CLR is now almost on pair with Dwain's solution, but it's not there yet...
But, do we compare the same things?
The answer is NO!
Check this:
SELECT [dbo].ReformatEmail(email)
FROM (VALUES ('A.B@C.D'),('A@B.C'),(''),(NULL)) AS s(email)
SELECT RIGHT(
STUFF(a.email
,CHARINDEX('@', a.email)
,1, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1) + '@')
,LEN(a.email))
FROM (VALUES ('A.B@C.D'),('A@B.C'),(''),(NULL)) AS s(email)
CROSS APPLY (SELECT email COLLATE Latin1_General_BIN2) a(email)
Oops! Yeah, all provided non-CLR solutions do suffer from the same problem - they only work if the first part of email has a "dot" in it, and they all do incorrect transformation for most common email format: name@domain.com and fail on empty strings...
Let's add check into STUFF solution:
PRINT 'Using STUFF (Dwain) with CHECK'
SET @Start = GETUTCDATE()
SELECT @var=
CASE WHEN CHARINDEX('.',a.email,0)>0 AND CHARINDEX('.',a.email,0) < CHARINDEX('@',a.email,0) THEN
RIGHT(
STUFF(a.email
,CHARINDEX('@', a.email)
,1, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1) + '@')
,LEN(a.email))
ELSE a.email END
FROM #TestTable
CROSS APPLY (SELECT email COLLATE Latin1_General_BIN2) a(email)
PRINT DATEDIFF(ms,@start,GETUTCDATE());
New results:
Using SUBSTRING and CHARINDEX
2076
Using PARSENAME
1923
Using PARSENAME with CROSS APPLY
1890
Using STUFF (Dwain)
266
Using STUFF (Dwain) with CHECK
366
Using CLR
343
Huray! CLR finally wins :-D! I should be honest here: it does not always win, margin is too small, 10% of the time they perform equally and another 20% of time STUFF's is faster (based on 10 executions)
Actually, there is another aspect of using STUFF in Dwain's style - it does change collation of original string, which may not be appropriate sometimes.
November 21, 2012 at 6:40 am
dwain.c (11/21/2012)
Prettiness and speed are apparently not conflicting requirements:
PRINT 'Dwain.C COLLATE + CROSS APPLY'
SET STATISTICS TIME ON
SELECT @Email=RIGHT(
STUFF(a.email
,CHARINDEX('@', a.email)
,1, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1) + '@')
,LEN(a.email))
FROM #Emails
CROSS APPLY (
SELECT email COLLATE Latin1_General_BIN2) a(email)
SET STATISTICS TIME OFF
(1000000 row(s) affected)
Eugene Eletin
SQL Server Execution Times:
CPU time = 3875 ms, elapsed time = 3875 ms.
Scott Pletcher
SQL Server Execution Times:
CPU time = 4594 ms, elapsed time = 4582 ms.
Dwain.C
SQL Server Execution Times:
CPU time = 2266 ms, elapsed time = 2269 ms.
Dwain.C COLLATE
SQL Server Execution Times:
CPU time = 1875 ms, elapsed time = 1876 ms.
Dwain.C COLLATE + CROSS APPLY
SQL Server Execution Times:
CPU time = 1875 ms, elapsed time = 1867 ms.
Another tweak helps a bit (maybe... assuming I've got this right)
Change
,1, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1) + '@')
to
,0, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1))
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537November 21, 2012 at 7:04 am
Mark-101232 (11/21/2012)
...Another tweak helps a bit (maybe... assuming I've got this right)
Change
,1, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1) + '@')
to
,0, '.' + LEFT(a.email, CHARINDEX('.', a.email)-1))
Nope, it doesn't do much for performance on 1000,000 rows. May be if you have 100,000,000 of emails, then one less string concatenation would be visible...
As per my post, all non-CLR solution lack the check for simple emails (ones which do not contain "dot" before "@"). Adding such check, makes all them slower.
November 21, 2012 at 7:16 am
Eugene,
If you're going to cast aspersions on my code with garbage data, I think you should first look to your own. Try this:
select
SUBSTRING(email, CHARINDEX('.', email)+1,CHARINDEX('@', email)-CHARINDEX('.', email)-1)
+ '.'
+ SUBSTRING(email, 0, CHARINDEX('.', email))
+ SUBSTRING(email, CHARINDEX('@', email),LEN(email))
FROM (VALUES ('A.B@C.D'),('A@B.C'),(''),(NULL)) AS s(email)
select
PARSENAME(left(email, charindex('@', email) - 1), 1) + '.' +
PARSENAME(left(email, charindex('@', email) - 1), 2) +
substring(email, charindex('@', email), len(email))
FROM (VALUES ('A.B@C.D'),('A@B.C'),(''),(NULL)) AS s(email)
Both of the original solutions (yours and Scott's) are subject to the same error! So both must introduce the overhead of the check in the event you want to compare apples to apples.
Also, I read Jeff's article where he commented on using SET STATISTICS TIME ON to check performance statistics and my recollection is that he only cautioned against its use when timing user-defined functions. So I'll stick with that thank you very much.
If you want to time mine with a check, try this version. May still not beat the CLR but at least I can feel comfortable that it is my code.
PRINT 'Dwain.C with CHECK'
SET STATISTICS TIME ON
SELECT @Email=CASE WHEN x > yTHEN a.email
ELSE RIGHT(
STUFF(a.email
,y
,1, '.' + LEFT(a.email, CASE WHEN x > 0 THEN x-1 ELSE 0 END) + '@')
,LEN(a.email)) END
FROM #Emails a
CROSS APPLY (
SELECT CHARINDEX('.', a.email), CHARINDEX('@', a.email)) b(x, y)
SET STATISTICS TIME OFF
Edit: May have been too hot to notice at the time but the CHARINDEX('@', a.email) can be done only once in the CROSS APPLY, so I modified the code accordingly.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
November 21, 2012 at 8:51 am
dwain.c (11/21/2012)
...Both of the original solutions (yours and Scott's) are subject to the same error! So both must introduce the overhead of the check in the event you want to compare apples to apples.
...
That is exactly what I have said in my long post:
all provided non-CLR solutions do suffer from the same problem
You are probably right about using or not using SET STATISTICS TIME, I honestly don't remember all details about it, so I guess it's fine to use it in this case.
Also you are right about using "own code" - completely agree 😉
Actually, what do you think about using COLLATION? As soon as you remove one used, all became much slower...
November 21, 2012 at 5:27 pm
Eugene Elutin (11/21/2012)
Here we go!...
Actually, there is another aspect of using STUFF in Dwain's style - it does change collation of original string, which may not be appropriate sometimes.
Eugene Elutin (11/21/2012)
Actually, what do you think about using COLLATION? As soon as you remove one used, all became much slower...
I was wondering what you meant by the first quoted statement in your "long post." I'm thinking that changing the collation in a STUFF should not have any impact.
As to what I think about using COLLATE, well I'm usually of the mind if it works go with it. Testing would be the key here to make sure there are not any unwanted side effects. So far I have seen none but that doesn't mean there aren't any.
My biggest objection I suppose is how COLLATE clutters the code and if you're not "in" on what it does (or it is commented) it would be easy enough for someone to come along later and remove it without realizing the consequences. That's why I modified the one case to do a single COLLATE in a CROSS APPLY (lends itself to commenting nicely). In latter test harnesses, I put the COLLATE on the email column of the #Emails table so as to spread the wealth and avoid the overhead of the CROSS APPLY, even though it seemed to be small.
I should also say, that while I commend you on your ability to quickly build working CLRs, in my opinion if you've got a reasonably close pure SQL solution it's probably better to simply go with it. Less in your CLR library to maintain. And in this particular case, the problem seems so specific I'd be surprised if the need ever came up again in the same application - it looks like a data migration (or student problem) kind of thing to me. To me "common library routines" should consist of a set of utility-oriented functions that are reusable many times under many different circumstances, rather than for one-off cases.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
November 21, 2012 at 7:38 pm
On the subject of collations, the use of a binary based collation can work performance miracles on certain types of code. Yeah... it clutters up the code a bit but it can really be worth it.
As for someone not understanding what the COLLATE is for and possibly removing it for that reason, a simple comment in the code would do.
{Edit} And, no, I don't believe that COLLATE will help a STUFF (but I haven't tested it). Logically speaking, it should only help when string comparisons are being made.
--Jeff Moden
Change is inevitable... Change for the better is not.
November 21, 2012 at 7:42 pm
Jeff Moden (11/21/2012)
I don't believe that COLLATE will help a STUFF (but I haven't tested it). Logically speaking, it should only help when string comparisons are being made.
I didn't think so either. 🙂 That wasn't what I meant by impact.
It does seem to be helping the CHARINDEX in this case though, but sometimes I've found no effect.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 15 posts - 16 through 30 (of 35 total)
You must be logged in to reply to this topic. Login to reply