Performance issue with tally solution

Question

Post reply

Performance issue with tally solution

Viewing 15 posts - 436 through 450 (of 522 total)

You must be logged in to reply to this topic. Login to reply

UMG Developer SSChampion Points: 13482 More actions · Answer 1

Florian Reischl (5/11/2009)
Have a look at the attached statistic results of my tests.

Flo,

Which cursor based split solution were you using in those statistics?

Jeff Moden SSC Guru Points: 1004470 More actions · Answer 2

UMG Developer (5/11/2009)
1 Line1|Line2|Line3

Heh... "Tab or whatever to separate fields"...

Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

UMG Developer SSChampion Points: 13482 More actions · Answer 3

Jeff Moden (5/11/2009)
UMG Developer (5/11/2009)
1 Line1|Line2|Line3
Heh... "Tab or whatever to separate fields"...
Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?

Like I said it doesn't currently exist as a text file, but I can export it to one fairly easily. (Why I included the "whatever", as whatever would be workable.)

Currently it is in a table as previously described with three fields: Comment_ID (int), Comment_Part_Num (int), Comment_Part (varchar(4000))

There is a | at the end of each line in the comment, but a given line could be broken across many parts, and a single part can contain many lines.

I assume I wouldn't need to export the Comment_Part_Num field, just use it to order the export correctly.

Jeff Moden SSC Guru Points: 1004470 More actions · Answer 4

Florian Reischl (5/11/2009)
Hi Paul
I know this sound's strange but all my tests showed, that the set-based solutions using a Tally table run into problems with huge text and large result items.
Remember Phil's test to split the book Moby Dick. I included this into my tests to split the text of the book by commas with following results:
CLR: 00:00.867
Cursor: 00:01.310
Set-Based: 00:06.260
The book has about 1.2 million characters and the text specified by UMG has more than 4.3 million characters.
Greets
Flo

Heh... Phil didn't tell you about the original "Whale Tests" we did where the Tally table won on my machine but lost on his. I'm going to have to make the time to do my own testing of all these different methods.... Heh... except for the CLR's. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004470 More actions · Answer 5

UMG Developer (5/11/2009)
Jeff Moden (5/11/2009)
UMG Developer (5/11/2009)
1 Line1|Line2|Line3
Heh... "Tab or whatever to separate fields"...
Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?
Like I said it doesn't currently exist as a text file, but I can export it to one fairly easily. (Why I included the "whatever", as whatever would be workable.)
Currently it is in a table as previously described with three fields: Comment_ID (int), Comment_Part_Num (int), Comment_Part (varchar(4000))
There is a | at the end of each line in the comment, but a given line could be broken across many parts, and a single part can contain many lines.
I assume I wouldn't need to export the Comment_Part_Num field, just use it to order the export correctly.

Ah... got it. I was confused. Thanks.

Take a look at the link in my signature. Any chance of you cranking out the first, say, 10 rows of data and attaching it as a file? It would provide a whole lot of answers for folks. Of course, don't do it if the file contains any sensitive or private information.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

UMG Developer SSChampion Points: 13482 More actions · Answer 6

Jeff Moden (5/11/2009)
Take a look at the link in my signature. Any chance of you cranking out the first, say, 10 rows of data and attaching it as a file? It would provide a whole lot of answers for folks. Of course, don't do it if the file contains any sensitive or private information.

I was trying to avoid this, but here is a script to create a table with some sample data, which is essentially what I am starting with:

[font="Courier New"]--Drop table comments

--Create a sample comments table

CREATE TABLE Comments (

CommentID INT,

CommentPartNum INT,

CommentCont CHAR(1),

CommentDate DATETIME,

CommentPart VARCHAR(4000));

--Add some sample data

INSERT INTO Comments

SELECT

1, 1, 'N', '03/23/2009', 'Comment Line 1'

UNION ALL

SELECT

1, 2, 'N', '03/26/2009', 'Comment Line 2|Comment Line 3|Comment Line 4'

UNION ALL

SELECT

1, 3, 'Y', '03/26/2009', 'Comment Line 4 continued'

UNION ALL

SELECT

2, 1, 'N', '03/23/2009', 'Comment '

UNION ALL

SELECT

2, 2, 'Y', '03/24/2009', 'Line '

UNION ALL

SELECT

2, 3, 'Y', '03/24/2009', '1|Comment Line 2'

UNION ALL

SELECT

2, 4, 'N', '03/25/2009', 'Comment Line 3|Comment Line 4';

--Display the sample data

SELECT *

FROM Comments;

[/font]

(Creating fake data is easier than trying to sanitize it.)

It wasn't even as painful as I thought it would be...

Edited because I left out one small piece, and that is a field to signify if this part is a continuation of the previous part...

UMG Developer SSChampion Points: 13482 More actions · Answer 7

Florian Reischl (5/11/2009)
Well, this brings up some more "fun"!
I did some changes in my previous version. Give it a try:

Flo,

Thanks for that, I re-did it a little bit to handle multiple comments:

[font="Courier New"]

---==================================

-- Some source data

DECLARE @Source TABLE (Id INT NOT NULL IDENTITY, CommentID INT, CommentPart INT, Txt VARCHAR(100))

INSERT INTO @Source

SELECT 1, 1, 'aaa,bb'

UNION ALL SELECT 1, 2, 'b,ccc,ddd'

UNION ALL SELECT 1, 3, 'ddd'

UNION ALL SELECT 1, 4, 'ddd,e'

UNION ALL SELECT 1, 5, 'eee'

UNION ALL SELECT 1, 6, ',fff'

UNION ALL SELECT 2, 1, '1,22,33'

UNION ALL SELECT 2, 2, '3,4444,55'

UNION ALL SELECT 2, 3, '5'

UNION ALL SELECT 2, 4, '55,66'

UNION ALL SELECT 2, 5, '6666'

UNION ALL SELECT 2, 6, ',7777777'

---==================================

-- Destination table

DECLARE @Result TABLE (Id INT NOT NULL IDENTITY, SourceId INT, CommentID INT, Txt VARCHAR(100), Sequence INT)

---==================================

-- Delimiter

DECLARE @Delimiter VARCHAR(20)

DECLARE @DelimiterLen INT

SELECT @Delimiter = ','

SELECT @DelimiterLen = LEN(@Delimiter)

---==================================

-- Split items without a required leading/trailing delimiter

--

-- IMPORATNT:

-- If the source row was determined correct (with the delimiter) this will return an empty row

INSERT INTO @Result (

SourceId,

CommentId,

Txt,

Sequence

)

SELECT

s.Id,

s.CommentID,

l.Item,

1

FROM @Source s

CROSS APPLY

(

SELECT

SUBSTRING(s.Txt, 1, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, 1) - 1, -1), LEN(s.Txt))) Item,

1 Sorting

UNION ALL

SELECT TOP 100 PERCENT

SUBSTRING(s.Txt, t.N + @DelimiterLen, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, t.N + 1) - t.N - 1, -t.N - 1), LEN(s.Txt) - t.N)) Item,

2 Sorting

FROM Counter t

WHERE t.N <= LEN(s.Txt)

AND SUBSTRING(s.Txt, t.N, @DelimiterLen) = @Delimiter

ORDER BY Sorting, t.N

) l

---==================================

-- Move single fragments up

WHILE (1 = 1)

BEGIN

; WITH

single_fragment (CommentID, Id, Txt) AS

(

SELECT CommentID, MIN(Id), MIN(Txt) FROM @Result GROUP BY CommentID, SourceId HAVING COUNT(*) = 1

)

UPDATE r SET

r.Txt = r.Txt + sf.Txt,

r.Sequence = r.Sequence + 1

FROM @Result r

JOIN single_fragment sf ON r.CommentID = sf.CommentID AND r.Id + r.Sequence = sf.Id

IF (@@ROWCOUNT = 0)

BREAK

END

---==================================

-- Move last fragments up

; WITH

last_item (CommentID, Id) AS

(

SELECT CommentID, MAX(Id) FROM @Result GROUP BY CommentID, SourceId

)

UPDATE r1 SET

Txt = r1.Txt + r2.Txt,

Sequence = r1.Sequence + 1

FROM last_item li

JOIN @Result r1 ON li.CommentID = r1.CommentID AND li.Id = r1.Id

JOIN @Result r2 ON r1.CommentID = r2.CommentID AND r1.Id + r1.Sequence = r2.Id

WHERE

r1.Txt != ''

---==================================

-- Delete the rows which have been joined up to the previous

DELETE r1

FROM @Result r1

JOIN @Result r2 ON r1.Id > r2.Id AND r1.Id < r2.Id + r2.Sequence

---==================================

-- Delete the empty last rows which resulted by the initial

-- tally split

; WITH

last_item AS ( SELECT MAX(Id) Id FROM @Result GROUP BY SourceId )

DELETE r

FROM last_item li

JOIN @Result r ON li.Id = r.Id

WHERE r.Txt = ''

---==================================

-- Result

SELECT * FROM @Source

SELECT * FROM @Result

[/font]

I probably missed what was necessary on the last two parts of the process to deal with the multiple comments.

It took about a minute to copy the data into a table with the structure of @Source, but that wouldn't really count as I doubt it would add anything to the time of bringing the data over the WAN from Oracle during a real refresh. The actual splitting didn't use parallelism, but that may have been on purpose. It's been running over an hour so far, but I think it is close to done. 😉 (Edit: After 2 and half hours I killed it, it was on the "Move last fragments up" step.)

I'm still wrapping my head around everything it is doing, but it is an interesting approach that I wouldn't have come up with.

Jeff Moden SSC Guru Points: 1004470 More actions · Answer 8

Once you recognize the "pattern", you don't need all the ancillary conditional data moves and deletes. The "pattern" is that if it's NOT a continuation line, add a leading delimiter to the concatenation by CommentID in order by the CommentPartNum.

[font="Courier New"]--Drop table comments

--Create a sample comments table

CREATE TABLE Comments

(

CommentID INT NOT NULL,

CommentPartNum INT NOT NULL,

CommentCont CHAR(1) NOT NULL,

CommentDate DATETIME,

CommentPart VARCHAR(4000)

)

--Add some sample data

INSERT INTO Comments

SELECT 1, 1, 'N', '03/23/2009', 'Comment Line 1' UNION ALL

SELECT 1, 2, 'N', '03/26/2009', 'Comment Line 2|Comment Line 3|Comment Line 4' UNION ALL

SELECT 1, 3, 'Y', '03/26/2009', 'Comment Line 4 continued' UNION ALL

SELECT 2, 1, 'N', '03/23/2009', 'Comment ' UNION ALL

SELECT 2, 2, 'Y', '03/24/2009', 'Line ' UNION ALL

SELECT 2, 3, 'Y', '03/24/2009', '1|Comment Line 2' UNION ALL

SELECT 2, 4, 'N', '03/25/2009', 'Comment Line 3|Comment Line 4'

CREATE CLUSTERED INDEX IXC_Comments

ON Comments (CommentID,CommentPartNum,CommentCont)

--Display the sample data

SELECT *

FROM Comments

DECLARE @Delimiter CHAR(1)

SELECT @Delimiter = '|'

;WITH cteComments AS

(

SELECT t1.CommentID,(SELECT CASE CommentCont

WHEN 'N' THEN @Delimiter

ELSE '' END + CAST(t2.CommentPart AS VARCHAR(MAX))

FROM Comments t2

WHERE t2.CommentID = t1.CommentID --- must match GROUP BY below

ORDER BY t2.CommentPartNum

FOR XML PATH('')) + @Delimiter AS Comment

FROM Comments t1

GROUP BY t1.CommentID

)

SELECT c.CommentID,

ca.CommentNum,

ca.Item

FROM cteComments c

CROSS APPLY (SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS CommentNum,

SUBSTRING(c.Comment, t.N +1, CHARINDEX(@Delimiter, c.Comment, t.N +1) -t.N -1) AS Item

FROM dbo.Tally t

WHERE t.N < LEN(c.Comment)

AND SUBSTRING(c.Comment, t.N, 1) = @Delimiter) ca

[/font]

==================================================================================================================================

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David Burrows SSC Guru Points: 65144 More actions · Answer 9

Nice one Jeff 😎

Far away is close at hand in the images of elsewhere.
Anon.

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 10

UMG Developer (5/11/2009)
Florian Reischl (5/11/2009)
Have a look at the attached statistic results of my tests.
Flo,
Which cursor based split solution were you using in those statistics?

Hi UMG

Here is the cursor I actually used in my tests. I compared it with all others but it seems to be the fastest.

[font="Courier New"]ALTER FUNCTION dbo.ufn_SplitString_Cursor_VM8

(

@Text VARCHAR(MAX),

@Delimiter VARCHAR(255)

)

RETURNS @ReturnData TABLE (Item VARCHAR(8000))

AS

BEGIN

DECLARE @Pos INT, @Next INT, @DelimiterLen INT

-- Get only once the len of the delimiter to avoid any recalculation

SELECT @DelimiterLen = LEN(@Delimiter)

-- Get the first occurence to get the item before the first delimiter

SELECT @Pos = CHARINDEX(@Delimiter, @Text, 1)

IF (@Pos = 0)

BEGIN

-- We didn't find any delimiter so return the complete text as one item

INSERT INTO @ReturnData

SELECT @Text

RETURN

END

ELSE

BEGIN

-- Insert the first item

INSERT INTO @ReturnData

SELECT SUBSTRING(@Text, 1, @Pos - 1)

END

-- Step over the first delimiter

SELECT @Pos = @Pos + @DelimiterLen

-- Start an infinite loop to avoid a second CHARINDEX call for each item

WHILE (1 = 1)

BEGIN

-- Get the next delimiter position from our previous position

SELECT @Next = CHARINDEX(@delimiter, @Text, @Pos)

-- CHARINDEX returned "0" so we have to break the loop

IF (@Next = 0) BREAK

-- Insert the next found item

INSERT INTO @ReturnData

SELECT SUBSTRING(@Text, @Pos, @Next - @Pos)

-- Step of the delimiter

SELECT @Pos = @Next + @DelimiterLen

END

-- Due to the fact that our cursor only jumps from delimter to delimiter

-- the last item would be lost

INSERT INTO @ReturnData

SELECT SUBSTRING(@Text, @Pos, LEN(@Text) - 1)

RETURN

END

[/font]

Thanks to Steve for the bug fix if the text doesn't contain any delimiter

Greets

Flo

Florian Reischl SSC-Dedicated Points: 37301 More actions · Answer 11

UMG Developer (5/11/2009)
... (Edit: After 2 and half hours I killed it, it was on the "Move last fragments up" step.)

Well, errmm... You need bigger hardware! 😀

Seems to depend on the count of rows. I will inspect later today. Do you know how long the previous steps took?

Greets

Flo

UMG Developer SSChampion Points: 13482 More actions · Answer 12

Jeff Moden (5/11/2009)
Once you recognize the "pattern", you don't need all the ancillary conditional data moves and deletes. The "pattern" is that if it's NOT a continuation line, add a leading delimiter to the concatenation by CommentID in order by the CommentPartNum.

Jeff,

Now you've essentially got my roll it up and then split it method that takes ~2 minutes for the roll up, and ~16 minutes for the split. I do have to use the .VALUE trick with the FOR XML to avoid the entitization. (I know I should have posted my code.)

UMG Developer SSChampion Points: 13482 More actions · Answer 13

Florian Reischl (5/12/2009)
Well, errmm... You need bigger hardware! 😀
Seems to depend on the count of rows. I will inspect later today. Do you know how long the previous steps took?

Bigger hardware would be nice, but it isn't exactly small as it is. I didn't time the pieces individually but I think it took about 1 hour and 15 minutes to get to the "Move last fragments up" step. When I built the source table I left one little thing out, but I doubt that caused the issue, it just would have messed up the results. I'll try again tonight.

I'll, also, try your splitter UDF and see how it works for our data. Thanks!

Paul White SSC Guru Points: 150468 More actions · Answer 14

Do you get parallelism with Jeff's solution? I would think that would help a lot.

T-SQL UDFs always generate a serial plan.

(CLR UDFs don't - unless defined with a MAX datatype parameter)

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

RBarryYoung SSC Guru Points: 143329 More actions · Answer 15

Paul White (5/12/2009)
Do you get parallelism with Jeff's solution? I would think that would help a lot.
T-SQL UDFs always generate a serial plan.

Hmm, I did not know that. Is that for any scalar-udf in any query?

Does it also apply to TV-UDFs, including inline-TVFs?

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]