March 13, 2006 at 8:09 am
Jeff,
I was just answering the question posted by Carl. He was looking into using "another" tally table.
In rergards to speed. I can tell you that it is going to be faster than the loop when the number of rows reference by the cross join is not too high or when fixed string length is used instead of a char marker (in order to speed up local string searches). I have also tried several cases in which a loop has beaten a cross join solution on not very powerfull machines. The reason I mention the machine power is that a loop solution is intrinsically "serial" and a cross join solution is intrisically "parallel". If you happe to see a parallel execution on your plan chances are that an SMP 8 or 16 way will kill the beast a lot faster with a parallel solution than a serial one
Cheers,
* Noel
March 13, 2006 at 8:17 pm
Yes, and I haven't yet thanked you for the help. I see now that I was clumsily trying to create a datalength function, which I didn't know about. That's what I get for trying to take a short cut by not looking over all of the commands as I usually would when learning a new language. I should know better.
Anyway, thanks very much. You got me over the hump and now everything is working perfectly in that section of the project. When I'm done, I should have a fairly customizable HL7 parsing engine done in TSQL.
Carl E. Campbell
nyprehabmed.org
March 2, 2008 at 8:35 pm
Hi all,
this topic is almost what I am in need of. I have looked at Jeff's Code and I see the potential but I am a bit confused as to how it actually works and how I can adapt it to what I need.
In the example, Jeff brakes apart a sentence into rows based on a "space" between each word. Such as "My Name id Joe" becomes
MY
Name
is
Joe
I have tried the code and I don't quite get it. But that is an aside.
My pressing issue is I am trying to get into rows this:
In a single column, keeping it simple for examples sake, called cert, there is this information " State Agency county seat |10/02/2007|10/02/2008, City Agency city seat |10/03/2008|10/07/2008"
I need to separate the data into rows based on the "comma" not a "space" as is in Jeff's code.
As I said this is a simple example of what I need to do as there is over 150000 records in this table I am working with.
I would really appreciate it if someone can help.
Thanks
Gary
March 2, 2008 at 11:45 pm
Carl E. Campbell (3/10/2006)
I'm starting to get the hang of using these tally tables, but I'm a little foggy on the sequence of events (my experience up 'till now is in purely procedural languages). Could you, for instance, create a column that indicated the word number in addition to the sentance number, so that you would end up with:
I applogize for the delay. Yes... this is how it can be done...
--===== If it exists, drop the test table
IF OBJECT_ID('TempDB..#YourTable') IS NOT NULL
DROP TABLE #YourTable
--===== Create the a test table to contain the original posted data
-- plus a couple of more
CREATE TABLE #YourTable
(
SentenceID INT NOT NULL,
Sentence VARCHAR(8000)
)
INSERT INTO #YourTable (SentenceID,Sentence)
SELECT 000001,'How are you today?' UNION ALL
SELECT 000002,'What is your name?' UNION ALL
SELECT 000003,'Now is the time for all good men to come to the aid of their country.'
--===== Add a primary key to the test table just because it's the
-- right thing to do.
ALTER TABLE #YourTable
ADD PRIMARY KEY CLUSTERED (SentenceID)
--===== Now, split the "Sentence" column at the " " and display 1 per
-- row along with it's SentenceID number...
SELECT SentenceID,
Posit = t.N-LEN(REPLACE(LEFT(' '+y.Sentence+' ',t.N), ' ', '')),
SUBSTRING(' '+y.Sentence+' ',t.N+1,CHARINDEX(' ',' '+y.Sentence+' ',t.N+1)-t.N-1) AS ParsedData
FROM #YourTable y,
dbo.Tally t
WHERE t.N < LEN(' '+y.Sentence+' ')
AND SUBSTRING(' '+y.Sentence+' ',N,1) =' '
ORDER BY SentenceID
--Jeff Moden
Change is inevitable... Change for the better is not.
March 2, 2008 at 11:47 pm
noeld (3/10/2006)
For some things, you don't need the tally table
SELECT SentenceID,
datalength(Left(' '+y.Sentence+' ', N)) - datalength(replace(Left(' '+y.Sentence+' ', N), ' ', ''))InSentenceNo,
SUBSTRING(' '+y.Sentence+' ',N+1,CHARINDEX(' ',' '+y.Sentence+' ',N+1)-N-1) AS ParsedData
FROM #YourTable y,
dbo.Tally t
WHERE t.N < LEN(' '+y.Sentence+' ')
AND SUBSTRING(' '+y.Sentence+' ',N,1) =' '
ORDER BY SentenceID
Cheers,
Heh... you say you don't need a Tally table... but I'm pretty sure that's a tally table I see in your FROM clause...
--Jeff Moden
Change is inevitable... Change for the better is not.
March 2, 2008 at 11:51 pm
GF (3/2/2008)
Hi all,this topic is almost what I am in need of. I have looked at Jeff's Code and I see the potential but I am a bit confused as to how it actually works and how I can adapt it to what I need.
In the example, Jeff brakes apart a sentence into rows based on a "space" between each word. Such as "My Name id Joe" becomes
MY
Name
is
Joe
I have tried the code and I don't quite get it. But that is an aside.
My pressing issue is I am trying to get into rows this:
In a single column, keeping it simple for examples sake, called cert, there is this information " State Agency county seat |10/02/2007|10/02/2008, City Agency city seat |10/03/2008|10/07/2008"
I need to separate the data into rows based on the "comma" not a "space" as is in Jeff's code.
As I said this is a simple example of what I need to do as there is over 150000 records in this table I am working with.
I would really appreciate it if someone can help.
Thanks
Gary
It's simple... if your delimiter is a comma, just replace everything in the code that is ' ' with ',' and your done.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 5, 2009 at 11:41 am
Very helpful. Below is a variant that handles space delimited with double quotes around text.
select * from fnWord('1','"304191001" "BANITO ,VITO " " " "03/24/1957 " "M " "THE HILLS " "CA " "123450000 " "LOS ANGELES" "10/01/2008 " "10/12/2008 " 30 470 999 999 0 0 0 0 0 100 "M " "B01 " 0 8915 3902.53 1965.38 1937.15 -1254.80 -7659.80 "397732 " " " " " 999 "UNIDENTIFIED PHYS " 850 "ALL OTHER CASES "')
alter FUNCTION fnWord ( @SentenceID CHAR( 8 ),
@Sentence VARCHAR ( 1000 ) )
RETURNS @TableA TABLE ( SentenceID TEXT, Word TEXT , Debug TEXT)
AS
BEGIN
SET @Sentence = @Sentence + ' '
DECLARE @WordStart INT
DECLARE @WordEnd INT
DECLARE @OpenQuote INT
DECLARE @Debug varchar(80)
SET @WordStart = 1
SET @WordEnd = 1
SET @OpenQuote = 0
WHILE @WordStart < LEN( @Sentence )
BEGIN
SET @OpenQuote = 0
IF substring(@sentence,@WordStart,1) = '"'
SET @OpenQuote = 1
IF @OpenQuote = 1
SET @WordStart = CHARINDEX( '"', @Sentence, @WordStart+1 ) + 1
ELSE
SET @WordStart = CHARINDEX( ' ', @Sentence, @WordStart+1 )
SET @Debug = cast(@WordStart as varchar) + ' ' + cast(@WordEnd as varchar) + ' ' + cast(@OpenQuote as varchar)
INSERT @TableA
SELECT @SentenceID, CAST( SUBSTRING( @Sentence, @WordEnd, @WordStart - @WordEnd ) AS VARCHAR ),@Debug
SET @WordEnd = @WordStart + 1
SET @WordStart = @WordStart + 1
END
RETURN
END
GO
March 6, 2009 at 8:59 pm
Douglas De Ivey (10/17/2004)
No, what you want to do can't be done with SQL2000. You would need to join your 'inpt table' to the output from the function.
Apparently the next release of SQL Server will support this however to we will just have to wait...
Heh... sorry... couldn't let this one go. Any bets on that? 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
March 6, 2009 at 9:58 pm
Larry Trach (3/5/2009)
Very helpful. Below is a variant that handles space delimited with double quotes around text.
I have to ask why you would return the columns of the table variable as TEXT datatypes especially when the largest input is limited to VARCHAR(1000) in your code? You can join to TEXT datatypes and they have very few other functions that you can run up against them. Also, you don't need to use a loop or even a UDF for that matter... Take a look at the following articles...
http://www.sqlservercentral.com/articles/T-SQL/63003/
http://www.sqlservercentral.com/articles/TSQL/62867/
To split the text qualified, space delimited string, just replace those groups of delimiters with something like a tab or CHAR(2), and you're back to a nice easy single delimiter split using a Tally table.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 6, 2009 at 11:09 pm
a WHILE loop may actually perform better than a Tally table on function like these
Jeff, do you know if the size of the tally table make any difference when doing a cross join? I was curious if there was any overhead involved in having a million row table applied to a character string that might have a maximum length of only 50 or so characters.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 7, 2009 at 9:06 am
Bob Hovious (3/6/2009)
a WHILE loop may actually perform better than a Tally table on function like these
Jeff, do you know if the size of the tally table make any difference when doing a cross join? I was curious if there was any overhead involved in having a million row table applied to a character string that might have a maximum length of only 50 or so characters.
Damn... I never did formalize the While Loop testing for this. I'll get to that someday...
In the meantime, to answer your question, there's no difference if you've created the Tally table correctly. That is, you create the table and then apply a clustered PK to "N" so there's no fragmentation.
But, never take someone's word for something like that. Test it for yourself so the epidemic of "SQL Cloning" doesn't spread any further. Here's the test table setup code...
--=================================================================================================
-- Environment presets
--=================================================================================================
USE TEMPDB
SET NOCOUNT ON
--=================================================================================================
-- Conditionally drop the test tables so we can do reruns without much
-- effort
--=================================================================================================
IF OBJECT_ID('TempDB.dbo.Tally11K') IS NOT NULL
DROP TABLE dbo.Tally11K
IF OBJECT_ID('TempDB.dbo.Tally1M') IS NOT NULL
DROP TABLE dbo.Tally1M
IF OBJECT_ID('TempDB.dbo.JBMTest') IS NOT NULL
DROP TABLE dbo.JBMTest
--=================================================================================================
-- Build an 11K row Tally table
--=================================================================================================
--===== Create and populate an 11k row Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally11K
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally11K
ADD CONSTRAINT PK_Tally11K_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--=================================================================================================
-- Build a 1M row Tally table
--=================================================================================================
--===== Create and populate a million row Tally table on the fly
SELECT TOP 1000000
IDENTITY(INT,1,1) AS N
INTO dbo.Tally1M
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally1M
ADD CONSTRAINT PK_Tally1M_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--=================================================================================================
-- Build a row source test table (10K rows)
--=================================================================================================
--===== Create and populate a test table.
-- Column "RowNum" is an incrementing unique number
-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers
-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings
-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers
-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'
-- for all rows.
-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)
-- Jeff Moden
SELECT TOP 10000
RowNum = IDENTITY(INT,1,1),
SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,
SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06' AS VARCHAR(80)),
SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),
SomeHex12 = RIGHT(NEWID(),12)
INTO dbo.JBMTest
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== A table is not properly formed unless a Primary Key has been assigned
ALTER TABLE dbo.JBMTest
ADD CONSTRAINT PK_JBMTest_RowNum PRIMARY KEY CLUSTERED (RowNum)
... and here's the comparison code...
--*************************************************************************************************
-- Test the difference between the 2 Tally tables 10 times
--*************************************************************************************************
--===== Declare and preset local variables
DECLARE @Counter INT
SET @Counter = 10
DECLARE @Bitbucket VARCHAR(8000) --Place to throw results away
--===== Start the loop
WHILE @Counter > 0
BEGIN
PRINT REPLICATE('=',100)
--=================================================================================================
-- Test the split against the 11k Tally table throwing away the results so we don't measure
-- the speed of the display
--=================================================================================================
SET STATISTICS TIME ON
SELECT @Bitbucket = SUBSTRING(',' + src.SomeCSV, t.N+1, CHARINDEX(',', src.SomeCSV+',', t.N)-t.N)
FROM dbo.Tally11K t
CROSS JOIN dbo.JBMTest src
WHERE SUBSTRING(',' + src.SomeCSV, t.N, 1) = ','
AND t.N < LEN(src.SomeCSV)+2
SET STATISTICS TIME OFF
PRINT REPLICATE('=',100)
--=================================================================================================
-- Test the split against the 1M Tally table
--=================================================================================================
SET STATISTICS TIME ON
SELECT @Bitbucket = SUBSTRING(',' + src.SomeCSV, t.N+1, CHARINDEX(',', src.SomeCSV+',', t.N)-t.N)
FROM dbo.Tally1M t
CROSS JOIN dbo.JBMTest src
WHERE SUBSTRING(',' + src.SomeCSV, t.N, 1) = ','
AND t.N < LEN(src.SomeCSV)+2
SET STATISTICS TIME OFF
PRINT REPLICATE('=',100)
--===== Bottom of the loop
SET @Counter = @Counter - 1
END
Last, but not least, here's the results using 2k5 sp2 Developer's Edition on a 1.8 Ghz single CPU desktop box with 1G of Ram...
[font="Courier New"]====================================================================================================
SQL Server Execution Times:
CPU time = 500 ms, elapsed time = 591 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 500 ms, elapsed time = 493 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 484 ms, elapsed time = 489 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 484 ms, elapsed time = 487 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 485 ms, elapsed time = 486 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 484 ms, elapsed time = 486 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 484 ms, elapsed time = 493 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 500 ms, elapsed time = 498 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 532 ms, elapsed time = 591 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 484 ms, elapsed time = 510 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 500 ms, elapsed time = 498 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 500 ms, elapsed time = 501 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 484 ms, elapsed time = 500 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 500 ms, elapsed time = 509 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 516 ms, elapsed time = 600 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 516 ms, elapsed time = 525 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 515 ms, elapsed time = 529 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 532 ms, elapsed time = 527 ms.
====================================================================================================
====================================================================================================
SQL Server Execution Times:
CPU time = 500 ms, elapsed time = 525 ms.
====================================================================================================
SQL Server Execution Times:
CPU time = 515 ms, elapsed time = 521 ms.
====================================================================================================
[/font]
--Jeff Moden
Change is inevitable... Change for the better is not.
March 7, 2009 at 9:45 am
By an odd coincidence, I just did that. It doesn't seem to make a significant difference.
Results follow, parsing a sentence table of 1.8 million rows and change. I tested a standard million row tally table against a "tallyK" table of only 1000 rows and then a "tally100" table of 100 rows. The average sentence length was under 70 characters. Timings were based on results being written to disk.
Now I'm wondering how this would perform compared to a CROSS APPLY in SQL 2005.
------- million row tally table
------- million row tally table
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(12845056 row(s) affected)
SQL Server Execution Times:
CPU time = 89094 ms, elapsed time = 410569 ms.
------- thousand row tally table
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(12845056 row(s) affected)
SQL Server Execution Times:
CPU time = 83266 ms, elapsed time = 404904 ms.
------- hundred row tally table
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(12845056 row(s) affected)
SQL Server Execution Times:
CPU time = 84234 ms, elapsed time = 406530 ms.
----------------------------------------------------------------------------------
-- first we build a word list (just to set up the test)
----------------------------------------------------------------------------------
declare @words table (word varchar(20))
insert into @words
select 'Apple' union all
select 'Baker' union all
select 'Cappucino' union all
select 'Delta' union all
select 'Echolalia' union all
select 'Forger' union all
select 'Wumpus' union all
select 'X-Ray' /*union all
select 'Capitol' union all
select 'Zoom' union all
select 'Martyr' union all
select 'Hapsburg' union all
select 'Sicilian' union all
select 'Tupelo'*/
----------------------------------------------------------------------------------
-- make a home for various combinations of words to simulate true sentences
----------------------------------------------------------------------------------
create table #sentence (sentenceID int identity(1,1) primary key, sentence varchar(1000))
;with
cte1 as (select w1.word + ' ' + w2.word as sentence from @words w1 cross join @words w2 where w1.word <> w2.word)
,cte2 as (select w1.word + ' ' + sentence as sentence from @words w1 cross join cte1 )
,cte3 as (select w1.word + ' ' + sentence as sentence from @words w1 cross join cte2 )
,cte4 as (select w1.word + ' ' + sentence as sentence from @words w1 cross join cte3 )
,cte5 as (select w1.word + ' ' + sentence as sentence from @words w1 cross join cte4 )
,cte6 as (select w1.word + ' ' + sentence as sentence from @words w1 cross join cte5 )
insert into #sentence
select sentence from cte6
select count(*) from #sentence
select max(len(sentence)) from #sentence
----------------------------------------------------------------------------------
-- LET THE GAMES BEGIN
----------------------------------------------------------------------------------
set statistics time on;
print '------- million row tally table'
SELECT SentenceID,
Posit = t.N-LEN(REPLACE(LEFT(' '+s.Sentence+' ',t.N), ' ', '')),
SUBSTRING(' '+s.Sentence+' ',t.N+1,CHARINDEX(' ',' '+s.Sentence+' ',t.N+1)-t.N-1)AS ParsedData
FROM #sentence s
CROSS JOIN dbo.Tally t
WHERE t.N < LEN(' '+s.Sentence+' ')
AND SUBSTRING(' '+s.Sentence+' ',N,1)= ' '
ORDER BY SentenceID
set statistics time off;
set statistics time on;
print '------- thousand row tally table'
SELECT SentenceID,
Posit = t.N-LEN(REPLACE(LEFT(' '+s.Sentence+' ',t.N), ' ', '')),
SUBSTRING(' '+s.Sentence+' ',t.N+1,CHARINDEX(' ',' '+s.Sentence+' ',t.N+1)-t.N-1)AS ParsedData
FROM #sentence s
CROSS JOIN dbo.TallyK t
WHERE t.N < LEN(' '+s.Sentence+' ')
AND SUBSTRING(' '+s.Sentence+' ',N,1)= ' '
ORDER BY SentenceID
set statistics time off;
set statistics time on;
print '------- hundred row tally table'
SELECT SentenceID,
Posit = t.N-LEN(REPLACE(LEFT(' '+s.Sentence+' ',t.N), ' ', '')),
SUBSTRING(' '+s.Sentence+' ',t.N+1,CHARINDEX(' ',' '+s.Sentence+' ',t.N+1)-t.N-1)AS ParsedData
FROM #sentence s
CROSS JOIN dbo.Tally100 t
WHERE t.N < LEN(' '+s.Sentence+' ')
AND SUBSTRING(' '+s.Sentence+' ',N,1)= ' '
ORDER BY SentenceID
set statistics time off;
-- drop table #sentence
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 7, 2009 at 10:41 am
Bob Hovious (3/7/2009)
By an odd coincidence, I just did that. It doesn't seem to make a significant difference.
I just knew you would... that's one of the things I can absolutely count on with you. Nicely done and thanks for posting your version... I'm gonna have to dog-ear this thread. 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
March 7, 2009 at 10:58 am
I do it strictly in self-defense.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
March 7, 2009 at 11:41 am
Bob Hovious (3/7/2009)
I do it strictly in self-defense.
BWAA-HAA!!! Me too! Let's get together and burn some ISO's sometime! 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 16 through 29 (of 29 total)
You must be logged in to reply to this topic. Login to reply