April 22, 2009 at 3:00 pm
Sure you are right, a Tally should have a fillfactor of 100. Forgive me for my tests ;-).
Your test results on your laptop look quiet strange to me (maybe time for a new one...). I tried at work and at home on Sql2k5 and Sql2k8 and your solution seems always to be the fastest.
Greets
Flo
April 22, 2009 at 3:03 pm
Is there a specific post where I can find the SQL to setup a test suite?
April 22, 2009 at 3:04 pm
RBarryYoung (4/22/2009)
!!!Those are some freakin' incredible numbers!
Hi Barry!
Yeah! Looks like the second youth of "Tally" 🙂
April 22, 2009 at 3:09 pm
Lynn Pettis (4/22/2009)
Is there a specific post where I can find the SQL to setup a test suite?
Hi Lynn
Not containing all the different functions. If you want, I can try to setup a SQL Script with all containing SQL functions and one with the complete test environment for the string split functions.
The second one for the new tally solutions is just a combination of:
http://www.sqlservercentral.com/Forums/FindPost701763.aspx
And Bob's:
http://www.sqlservercentral.com/Forums/FindPost702016.aspx
Greets
Flo
April 22, 2009 at 4:02 pm
I've got a minor tweak to the double-barrelled tally solution. I think it's running consistently quicker in elapsed time, but sometimes my laptop has the older version coming out ahead. This version substitutes a MIN() for a TOP 1 in the nested cross apply. Flo, can you test it please?
-- double barrelled tally v2
select rowNum, item
into result3
from jbmTest
cross apply(select substring(csv,t1.O,Z) as item
from tally2 t1
cross apply (select min(t2.N) - t1.O as Z
from tally t2
where t2.N > t1.N
and substring(csv,t2.N,1) = ','
) ca2
where t1.N < len(csv) and substring(csv,t1.N,1) = ','
) ca1
A little explanation if anyone cares. Tally table speed has amazed me ever since reading about them in Jeff's article. During this thread, Paul White called it a brute force way to solve a problem, but brute force or not, magic is magic. Rationally, what was needed was more magic. If one tally table was fast, two would be faster.
In the classic tally solution, the tally table is used to find the beginning of each delimited string, but then reverts to CHARINDEX to find the end of the string. The problem was to be able to find both the beginning AND the ending position at tally table speeds. I first thought about doing a CROSS JOIN of two tally tables to get the begin and end combinations, but then realized that just as a CROSS APPLY is used to get the beginning position, it was possible to nest a second CROSS APPLY to find the ending position. To find the "next" separator, all that was needed was to find the lowest N from the second tally table (t2.N) which was greater than the N being used from the first tally table (t1.N).
I apologize for any confusion about the table names, and column names. "Tally2" is a two-column table (N int, O int). O is not zero, O is N+1, because that is all it contains. This is a trick to eliminate the N+1 calculations necessary to find the start of a string following a separator. It may not be worth the nanos saved, because it sacrifices flexibility in that it depends on the separator being a single character. The names will be changed to TwoColumnTally (N int, X int) in the final edition, or I could go back and edit my posted code tonight.
It still seems that there might be a faster way to isolate the ending position using a tally table. Hopefully somebody will find it. This has been a team effort all along and great fun to follow.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 22, 2009 at 4:10 pm
I'm also looking for the test data as well.
April 22, 2009 at 4:16 pm
I think it's posted earlier in the thread, Lynn. If I find it, how do I link you to an exact post number?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 22, 2009 at 4:19 pm
Click on the post # of the post. You can get a url like: http://www.sqlservercentral.com/Forums/FindPost702820.aspx. That one happens to be to you post asking how to it.
April 22, 2009 at 4:26 pm
DOH! 😛
I think it's buried in all of Flo's posted code here:
http://www.sqlservercentral.com/Forums/FindPost701763.aspx
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 23, 2009 at 4:19 am
Sorry to put a fly in the ointment, but I've tried a variation on your double-barrelled tally solution based on a problem mentioned here (splitting sentences into words at spaces).
I tested on my laptop (SS2K5 SP3) and a dual processor server (SS2k5 SP1) and in both cases the original tally+charindex always beat the double tally, taking about half the time (both cpu & elapsed).
Hopefully, I've made an implemenation error :-D.
Here's the code. The data I used came from a table of issues&actrions logged by users; presumably people willing to test can find something suitable to feed it.
use scratch
drop table #tally
drop table #tally2
create table #tally (n int not null, primary key clustered (n))
create table #tally2 (n int not null, o int not null primary key clustered (n))
insert into #tally
select top 11000
row_number() over(order by column_id) from master.sys.all_columns
insert into #tally2
select n,n+1 from #tally
drop table #word
drop table #words
drop table #words2
-- test data
-- This table contains nearly 5000 records of essentially arbitrary text
select issueid as 'Part_no',issue as 'Words'
into #word
from perf.dbo.tblissuetext
-- results
create table #words (SentenceId int, word varchar(max))
create table #words2 (SentenceId int, word varchar(max))
set statistics io on
set statistics time on
INSERT INTO #words
-- courtesy jeff moden ...
SELECT
w.part_no as 'SentenceID',
SUBSTRING(' '+w.words+' ',N+1,CHARINDEX(' ',' '+w.words+' ',N+1)-N-1) AS 'Word'
FROM
#tally
CROSS JOIN
#word w
WHERE
N t2.N
and
substring(' '+w.words+' ',t1.N,1) = ' '
) ca2
where
t2.N < len(' '+w.words+' ')
and
substring(' '+w.words+' ',t2.N,1) = ' '
) ca1
--select * from #words2 order by sentenceid
set statistics io off
set statistics time off
This was a typical execution (underscores+hex removed from temp table names for ease of reading)
Table '#words'. Scan count 0, logical reads 122766, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#tally'. Scan count 4692, logical reads 14120, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#word'. Scan count 1, logical reads 99, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 1219 ms, elapsed time = 1358 ms.
(122441 row(s) affected)
Table '#words2'. Scan count 0, logical reads 122766, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#tally'. Scan count 122441, logical reads 244928, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#tally2'. Scan count 4692, logical reads 14195, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#word'. Scan count 1, logical reads 99, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2656 ms, elapsed time = 2765 ms.
(122441 row(s) affected)
BTW, I tried preselecting the test data to prefix & suffix the spaces, but this didn't affect the outcome.
Derek
April 23, 2009 at 4:32 am
Derek and everyone,
I think it's worth pointing out now that we have wandered quite a long way from the original question by Flo. This is not a complaint - the discussion continues to fascinate; however I just wanted to highlight that we are now looking at solutions which are heavily optimized for shorter (and non-LOB) strings, with a high frequency of delimiting characters.
This question of optimization for certain cases, is currently steering me back away from T-SQL solutions and toward the various CLR routines which appear to be fastest in all tests so far, which scale well, and do not require permanent tables, extra indexes, or pre-computed columns for example.
I am very aware of the considerable mind-power that has been directed at trying to find a competing T-SQL solution; and I can't help but wonder what the C# solutions might look like had similar effort been expended there. To be clear - I have the greatest respect for Flo's coding (and SQL) skills: he exceeds my capabilities in C# by several orders of magnitude. That said, he is but one man (albeit with an outstanding apprentice!) - so, again, I wonder what other professional CLR developers might suggest?
Just my thoughts. I am still playing with T-SQL to find something closer to the Holy Grail I referred to recently. Bob and Jeff's solutions are clearly well thought out and cleverly implemented, but the problems I remarked on previously remain, and it bothers me.
Best wishes to all,
Paul
April 23, 2009 at 5:06 am
OK. I ran more tests. It seems that on my systems, tally+charindex performs better on lower numbers but the double tally scales better.
[font="Courier New"]
Tally + CHARINDEX Double tally
Rows in Rows out CPU Elapsed CPU Elapsed
4692 122441 2265 2418 4484 5723
9384 226967 4172 4908 8765 10413
18768 453934 8516 8867 11500 12589
37536 907868 16781 18498 14031 17002
75072 1815736 34860 43363 22187 28885
150144 3631472 67344 84889 45375 52511
[/font]
(All timings in ms)
Derek
April 23, 2009 at 5:44 am
Hi Bob
Bob Hovious (4/22/2009)
I've got a minor tweak to the double-barrelled tally solution. I think it's running consistently quicker in elapsed time, but sometimes my laptop has the older version coming out ahead. This version substitutes a MIN() for a TOP 1 in the nested cross apply. Flo, can you test it please?
Already tested with TOP(1) and ORDER. Duration is equal to the MIN().
"Tally2" is a two-column table (N int, O int). O is not zero, O is N+1, because that is all it contains. This is a trick to eliminate the N+1 calculations necessary to find the start of a string following a separator. It may not be worth the nanos saved, because it sacrifices flexibility in that it depends on the separator being a single character. The names will be changed to TwoColumnTally (N int, X int) in the final edition, or I could go back and edit my posted code tonight.
I already did some tests with a usually tally table instead of the new one. The performance goes down for about 7% what is more than some nanos:
CROSS APPLY (
SELECT SUBSTRING(CSV,t2.N+1,Z) AS item
FROM dbo.Tally t2
CROSS APPLY (
SELECT MIN(t1.N)- t2.N+1 as Z
FROM dbo.Tally t1
WHERE t1.N > t2.N
AND SUBSTRING(csv,t1.N,1) = ','
) ca2
WHERE t2.N t2.N
-- AND SUBSTRING(csv,t1.N,1) = ','
-- ) ca2
-- WHERE t2.N < LEN(CSV)
-- AND SUBSTRING(CSV,t2.N,1) = ','
) ca1
Greets
Flo
April 23, 2009 at 5:57 am
Derek Dongray (4/23/2009)
Sorry to put a fly in the ointment, but I've tried a variation on your double-barrelled tally solution based on a problem mentioned here (splitting sentences into words at spaces).I tested on my laptop (SS2K5 SP3) and a dual processor server (SS2k5 SP1) and in both cases the original tally+charindex always beat the double tally, taking about half the time (both cpu & elapsed).
Hopefully, I've made an implemenation error :-D.
Here's the code. The data I used came from a table of issues&actrions logged by users; presumably people willing to test can find something suitable to feed it.
Hi Derek
Thanks for your feedback!
I just tried your code. I just replaced your source table for the test data (I don't have it 😉 ) with the CSV table of Jeff. I get the following results:
--=====================================================
courtesy jeff moden ...
Table '#tally___000000000015'. Scan count 1000, logical reads 11000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#word___000000000017'. Scan count 1, logical reads 1001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 9406 ms, elapsed time = 10082 ms.
(528000 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
--=====================================================
Bob Hovious variant
Table 'Worktable'. Scan count 1, logical reads 4058, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#tally___000000000015'. Scan count 528, logical reads 1064, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#tally2___000000000016'. Scan count 1, logical reads 14, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#word___000000000017'. Scan count 1, logical reads 1001, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 4187 ms, elapsed time = 4548 ms.
(528000 row(s) affected)
Greets
Flo
April 23, 2009 at 7:06 am
Thanks, Flo 🙂
Derek, please don't feel you have to apologize. That needed to be tested. As usual, the answer is "It depends." The point is to understand what works best in which situations, and what the limitations are. That's why I warned everyone that the double-barreled solution was inflexible because it counted on a single character being the delimiter. That said, I see a lot of work with long strings consisting of lots of short elements separated by single character delimiters. Very seldom do I have to parse Moby Dick 😉
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 15 posts - 181 through 195 (of 522 total)
You must be logged in to reply to this topic. Login to reply