May 11, 2009 at 5:27 pm
Florian Reischl (5/11/2009)
Have a look at the attached statistic results of my tests.
Flo,
Which cursor based split solution were you using in those statistics?
May 11, 2009 at 5:56 pm
UMG Developer (5/11/2009)
1 Line1|Line2|Line3
Heh... "Tab or whatever to separate fields"...
Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 11, 2009 at 6:14 pm
Jeff Moden (5/11/2009)
UMG Developer (5/11/2009)
1 Line1|Line2|Line3Heh... "Tab or whatever to separate fields"...
Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?
Like I said it doesn't currently exist as a text file, but I can export it to one fairly easily. (Why I included the "whatever", as whatever would be workable.)
Currently it is in a table as previously described with three fields: Comment_ID (int), Comment_Part_Num (int), Comment_Part (varchar(4000))
There is a | at the end of each line in the comment, but a given line could be broken across many parts, and a single part can contain many lines.
I assume I wouldn't need to export the Comment_Part_Num field, just use it to order the export correctly.
May 11, 2009 at 8:12 pm
Florian Reischl (5/11/2009)
Hi PaulI know this sound's strange but all my tests showed, that the set-based solutions using a Tally table run into problems with huge text and large result items.
Remember Phil's test to split the book Moby Dick. I included this into my tests to split the text of the book by commas with following results:
CLR: 00:00.867
Cursor: 00:01.310
Set-Based: 00:06.260
The book has about 1.2 million characters and the text specified by UMG has more than 4.3 million characters.
Greets
Flo
Heh... Phil didn't tell you about the original "Whale Tests" we did where the Tally table won on my machine but lost on his. I'm going to have to make the time to do my own testing of all these different methods.... Heh... except for the CLR's. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
May 11, 2009 at 8:17 pm
UMG Developer (5/11/2009)
Jeff Moden (5/11/2009)
UMG Developer (5/11/2009)
1 Line1|Line2|Line3Heh... "Tab or whatever to separate fields"...
Please be very specific with no "whatevers"... What is the actual format of the lines/comments/other fields and what determines the end of a "full record"?
Like I said it doesn't currently exist as a text file, but I can export it to one fairly easily. (Why I included the "whatever", as whatever would be workable.)
Currently it is in a table as previously described with three fields: Comment_ID (int), Comment_Part_Num (int), Comment_Part (varchar(4000))
There is a | at the end of each line in the comment, but a given line could be broken across many parts, and a single part can contain many lines.
I assume I wouldn't need to export the Comment_Part_Num field, just use it to order the export correctly.
Ah... got it. I was confused. Thanks.
Take a look at the link in my signature. Any chance of you cranking out the first, say, 10 rows of data and attaching it as a file? It would provide a whole lot of answers for folks. Of course, don't do it if the file contains any sensitive or private information.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 11, 2009 at 8:48 pm
Jeff Moden (5/11/2009)
Take a look at the link in my signature. Any chance of you cranking out the first, say, 10 rows of data and attaching it as a file? It would provide a whole lot of answers for folks. Of course, don't do it if the file contains any sensitive or private information.
I was trying to avoid this, but here is a script to create a table with some sample data, which is essentially what I am starting with:
[font="Courier New"]--Drop table comments
--Create a sample comments table
CREATE TABLE Comments (
CommentID INT,
CommentPartNum INT,
CommentCont CHAR(1),
CommentDate DATETIME,
CommentPart VARCHAR(4000));
--Add some sample data
INSERT INTO Comments
SELECT
1, 1, 'N', '03/23/2009', 'Comment Line 1'
UNION ALL
SELECT
1, 2, 'N', '03/26/2009', 'Comment Line 2|Comment Line 3|Comment Line 4'
UNION ALL
SELECT
1, 3, 'Y', '03/26/2009', 'Comment Line 4 continued'
UNION ALL
SELECT
2, 1, 'N', '03/23/2009', 'Comment '
UNION ALL
SELECT
2, 2, 'Y', '03/24/2009', 'Line '
UNION ALL
SELECT
2, 3, 'Y', '03/24/2009', '1|Comment Line 2'
UNION ALL
SELECT
2, 4, 'N', '03/25/2009', 'Comment Line 3|Comment Line 4';
--Display the sample data
SELECT *
FROM Comments;
[/font]
(Creating fake data is easier than trying to sanitize it.)
It wasn't even as painful as I thought it would be...
Edited because I left out one small piece, and that is a field to signify if this part is a continuation of the previous part...
May 11, 2009 at 9:03 pm
Florian Reischl (5/11/2009)
Well, this brings up some more "fun"!I did some changes in my previous version. Give it a try:
Flo,
Thanks for that, I re-did it a little bit to handle multiple comments:
[font="Courier New"]
---==================================
-- Some source data
DECLARE @Source TABLE (Id INT NOT NULL IDENTITY, CommentID INT, CommentPart INT, Txt VARCHAR(100))
INSERT INTO @Source
SELECT 1, 1, 'aaa,bb'
UNION ALL SELECT 1, 2, 'b,ccc,ddd'
UNION ALL SELECT 1, 3, 'ddd'
UNION ALL SELECT 1, 4, 'ddd,e'
UNION ALL SELECT 1, 5, 'eee'
UNION ALL SELECT 1, 6, ',fff'
UNION ALL SELECT 2, 1, '1,22,33'
UNION ALL SELECT 2, 2, '3,4444,55'
UNION ALL SELECT 2, 3, '5'
UNION ALL SELECT 2, 4, '55,66'
UNION ALL SELECT 2, 5, '6666'
UNION ALL SELECT 2, 6, ',7777777'
---==================================
-- Destination table
DECLARE @Result TABLE (Id INT NOT NULL IDENTITY, SourceId INT, CommentID INT, Txt VARCHAR(100), Sequence INT)
---==================================
-- Delimiter
DECLARE @Delimiter VARCHAR(20)
DECLARE @DelimiterLen INT
SELECT @Delimiter = ','
SELECT @DelimiterLen = LEN(@Delimiter)
---==================================
-- Split items without a required leading/trailing delimiter
--
-- IMPORATNT:
-- If the source row was determined correct (with the delimiter) this will return an empty row
INSERT INTO @Result (
SourceId,
CommentId,
Txt,
Sequence
)
SELECT
s.Id,
s.CommentID,
l.Item,
1
FROM @Source s
CROSS APPLY
(
SELECT
SUBSTRING(s.Txt, 1, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, 1) - 1, -1), LEN(s.Txt))) Item,
1 Sorting
UNION ALL
SELECT TOP 100 PERCENT
SUBSTRING(s.Txt, t.N + @DelimiterLen, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, t.N + 1) - t.N - 1, -t.N - 1), LEN(s.Txt) - t.N)) Item,
2 Sorting
FROM Counter t
WHERE t.N <= LEN(s.Txt)
AND SUBSTRING(s.Txt, t.N, @DelimiterLen) = @Delimiter
ORDER BY Sorting, t.N
) l
---==================================
-- Move single fragments up
WHILE (1 = 1)
BEGIN
; WITH
single_fragment (CommentID, Id, Txt) AS
(
SELECT CommentID, MIN(Id), MIN(Txt) FROM @Result GROUP BY CommentID, SourceId HAVING COUNT(*) = 1
)
UPDATE r SET
r.Txt = r.Txt + sf.Txt,
r.Sequence = r.Sequence + 1
FROM @Result r
JOIN single_fragment sf ON r.CommentID = sf.CommentID AND r.Id + r.Sequence = sf.Id
IF (@@ROWCOUNT = 0)
BREAK
END
---==================================
-- Move last fragments up
; WITH
last_item (CommentID, Id) AS
(
SELECT CommentID, MAX(Id) FROM @Result GROUP BY CommentID, SourceId
)
UPDATE r1 SET
Txt = r1.Txt + r2.Txt,
Sequence = r1.Sequence + 1
FROM last_item li
JOIN @Result r1 ON li.CommentID = r1.CommentID AND li.Id = r1.Id
JOIN @Result r2 ON r1.CommentID = r2.CommentID AND r1.Id + r1.Sequence = r2.Id
WHERE
r1.Txt != ''
---==================================
-- Delete the rows which have been joined up to the previous
DELETE r1
FROM @Result r1
JOIN @Result r2 ON r1.Id > r2.Id AND r1.Id < r2.Id + r2.Sequence
---==================================
-- Delete the empty last rows which resulted by the initial
-- tally split
; WITH
last_item AS ( SELECT MAX(Id) Id FROM @Result GROUP BY SourceId )
DELETE r
FROM last_item li
JOIN @Result r ON li.Id = r.Id
WHERE r.Txt = ''
---==================================
-- Result
SELECT * FROM @Source
SELECT * FROM @Result
[/font]
I probably missed what was necessary on the last two parts of the process to deal with the multiple comments.
It took about a minute to copy the data into a table with the structure of @Source, but that wouldn't really count as I doubt it would add anything to the time of bringing the data over the WAN from Oracle during a real refresh. The actual splitting didn't use parallelism, but that may have been on purpose. It's been running over an hour so far, but I think it is close to done. 😉 (Edit: After 2 and half hours I killed it, it was on the "Move last fragments up" step.)
I'm still wrapping my head around everything it is doing, but it is an interesting approach that I wouldn't have come up with.
May 11, 2009 at 11:14 pm
Once you recognize the "pattern", you don't need all the ancillary conditional data moves and deletes. The "pattern" is that if it's NOT a continuation line, add a leading delimiter to the concatenation by CommentID in order by the CommentPartNum.
[font="Courier New"]--Drop table comments
--Create a sample comments table
CREATE TABLE Comments
(
CommentID INT NOT NULL,
CommentPartNum INT NOT NULL,
CommentCont CHAR(1) NOT NULL,
CommentDate DATETIME,
CommentPart VARCHAR(4000)
)
--Add some sample data
INSERT INTO Comments
SELECT 1, 1, 'N', '03/23/2009', 'Comment Line 1' UNION ALL
SELECT 1, 2, 'N', '03/26/2009', 'Comment Line 2|Comment Line 3|Comment Line 4' UNION ALL
SELECT 1, 3, 'Y', '03/26/2009', 'Comment Line 4 continued' UNION ALL
SELECT 2, 1, 'N', '03/23/2009', 'Comment ' UNION ALL
SELECT 2, 2, 'Y', '03/24/2009', 'Line ' UNION ALL
SELECT 2, 3, 'Y', '03/24/2009', '1|Comment Line 2' UNION ALL
SELECT 2, 4, 'N', '03/25/2009', 'Comment Line 3|Comment Line 4'
CREATE CLUSTERED INDEX IXC_Comments
ON Comments (CommentID,CommentPartNum,CommentCont)
--Display the sample data
SELECT *
FROM Comments
DECLARE @Delimiter CHAR(1)
SELECT @Delimiter = '|'
;WITH cteComments AS
(
SELECT t1.CommentID,(SELECT CASE CommentCont
WHEN 'N' THEN @Delimiter
ELSE '' END + CAST(t2.CommentPart AS VARCHAR(MAX))
FROM Comments t2
WHERE t2.CommentID = t1.CommentID --- must match GROUP BY below
ORDER BY t2.CommentPartNum
FOR XML PATH('')) + @Delimiter AS Comment
FROM Comments t1
GROUP BY t1.CommentID
)
SELECT c.CommentID,
ca.CommentNum,
ca.Item
FROM cteComments c
CROSS APPLY (SELECT ROW_NUMBER() OVER (ORDER BY t.N) AS CommentNum,
SUBSTRING(c.Comment, t.N +1, CHARINDEX(@Delimiter, c.Comment, t.N +1) -t.N -1) AS Item
FROM dbo.Tally t
WHERE t.N < LEN(c.Comment)
AND SUBSTRING(c.Comment, t.N, 1) = @Delimiter) ca
[/font]
==================================================================================================================================
--Jeff Moden
Change is inevitable... Change for the better is not.
May 12, 2009 at 2:13 am
Nice one Jeff 😎
Far away is close at hand in the images of elsewhere.
Anon.
May 12, 2009 at 2:17 am
UMG Developer (5/11/2009)
Florian Reischl (5/11/2009)
Have a look at the attached statistic results of my tests.Flo,
Which cursor based split solution were you using in those statistics?
Hi UMG
Here is the cursor I actually used in my tests. I compared it with all others but it seems to be the fastest.
[font="Courier New"]ALTER FUNCTION dbo.ufn_SplitString_Cursor_VM8
(
@Text VARCHAR(MAX),
@Delimiter VARCHAR(255)
)
RETURNS @ReturnData TABLE (Item VARCHAR(8000))
AS
BEGIN
DECLARE @Pos INT, @Next INT, @DelimiterLen INT
-- Get only once the len of the delimiter to avoid any recalculation
SELECT @DelimiterLen = LEN(@Delimiter)
-- Get the first occurence to get the item before the first delimiter
SELECT @Pos = CHARINDEX(@Delimiter, @Text, 1)
IF (@Pos = 0)
BEGIN
-- We didn't find any delimiter so return the complete text as one item
INSERT INTO @ReturnData
SELECT @Text
RETURN
END
ELSE
BEGIN
-- Insert the first item
INSERT INTO @ReturnData
SELECT SUBSTRING(@Text, 1, @Pos - 1)
END
-- Step over the first delimiter
SELECT @Pos = @Pos + @DelimiterLen
-- Start an infinite loop to avoid a second CHARINDEX call for each item
WHILE (1 = 1)
BEGIN
-- Get the next delimiter position from our previous position
SELECT @Next = CHARINDEX(@delimiter, @Text, @Pos)
-- CHARINDEX returned "0" so we have to break the loop
IF (@Next = 0) BREAK
-- Insert the next found item
INSERT INTO @ReturnData
SELECT SUBSTRING(@Text, @Pos, @Next - @Pos)
-- Step of the delimiter
SELECT @Pos = @Next + @DelimiterLen
END
-- Due to the fact that our cursor only jumps from delimter to delimiter
-- the last item would be lost
INSERT INTO @ReturnData
SELECT SUBSTRING(@Text, @Pos, LEN(@Text) - 1)
RETURN
END
[/font]
Thanks to Steve for the bug fix if the text doesn't contain any delimiter
Greets
Flo
May 12, 2009 at 2:21 am
UMG Developer (5/11/2009)
... (Edit: After 2 and half hours I killed it, it was on the "Move last fragments up" step.)
Well, errmm... You need bigger hardware! 😀
Seems to depend on the count of rows. I will inspect later today. Do you know how long the previous steps took?
Greets
Flo
May 12, 2009 at 8:57 am
Jeff Moden (5/11/2009)
Once you recognize the "pattern", you don't need all the ancillary conditional data moves and deletes. The "pattern" is that if it's NOT a continuation line, add a leading delimiter to the concatenation by CommentID in order by the CommentPartNum.
Jeff,
Now you've essentially got my roll it up and then split it method that takes ~2 minutes for the roll up, and ~16 minutes for the split. I do have to use the .VALUE trick with the FOR XML to avoid the entitization. (I know I should have posted my code.)
May 12, 2009 at 9:02 am
Florian Reischl (5/12/2009)
Well, errmm... You need bigger hardware! 😀Seems to depend on the count of rows. I will inspect later today. Do you know how long the previous steps took?
Bigger hardware would be nice, but it isn't exactly small as it is. I didn't time the pieces individually but I think it took about 1 hour and 15 minutes to get to the "Move last fragments up" step. When I built the source table I left one little thing out, but I doubt that caused the issue, it just would have messed up the results. I'll try again tonight.
I'll, also, try your splitter UDF and see how it works for our data. Thanks!
May 12, 2009 at 9:09 am
Do you get parallelism with Jeff's solution? I would think that would help a lot.
T-SQL UDFs always generate a serial plan.
(CLR UDFs don't - unless defined with a MAX datatype parameter)
May 12, 2009 at 9:21 am
Paul White (5/12/2009)
Do you get parallelism with Jeff's solution? I would think that would help a lot.T-SQL UDFs always generate a serial plan.
Hmm, I did not know that. Is that for any scalar-udf in any query?
Does it also apply to TV-UDFs, including inline-TVFs?
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
Viewing 15 posts - 436 through 450 (of 522 total)
You must be logged in to reply to this topic. Login to reply