fast file split on CRLF?

Question

fast file split on CRLF?

Eric Mamet

SSChampion

Points: 11728
More actions
July 4, 2008 at 6:10 am

#189204

Dear all,
I am trying to write a stored procedure that splits a file content (sent as one VARCHAR(MAX) parameter) into as many rows as there are lines.
The thing works but, despite using tally table, seems still too slow.
The IO is reasonable ("small" tally table) but it's burning too much CPU and I can't get it to perform faster...
On my machine, splitting a 1Mb file (~4000 lines) takes around 15 seconds and I would like it faster.
Here is an extract from my code that reproduces the speed problem (the actual data I use in @CSVContent is about 1Mb but I thought it might not look good in this post ;))
This code simply extracts the position of the various line separators and this is where my CPU "disappears".
DECLARE @CSVContent NVARCHAR(MAX)
SELECT @CSVContent = '"8ebbbbbbWhatever","and another column","boum"
"thingy","machin","truc"
"previous line was empty","one column missing"
last line...'
DECLARE @Separator VARCHAR(10), @SeparatorLen INT
SET @Separator = CHAR(13) + CHAR(10); -- CR LF
SET @SeparatorLen = LEN(@Separator);
DECLARE @CSVContentLen INT ; SELECT @CSVContentLen = LEN(@CSVContent)
PRINT @CSVContentLen;
;WITH Separators(Pos)
AS (
SELECT n_Id
FROM NetFinance.Number
WHERE SUBSTRING(@CSVContent, n_Id, @SeparatorLen) = @Separator
--WHERE CHARINDEX(@Separator, @CSVContent, n_Id) = n_Id -- worse...
AND n_Id <= @CSVContentLen
)
SELECT * FROM Separators
I got a 3 folds improvment by changing the CHARINDEX into SUBSTRING but I was hoping for a lot faster.
Thanks for your help
Eric

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 1

If you are trying to split the lines into a table, why are you just returning the separators? Do you actually want separate rows in a table or???

If the data was originally in a file, why did you import the whole file into a single variable instead of into a table to begin with?

If you don't mind... I need to see the code you used to create your Number table... please include any and all keys and indexes you may have put on the Number table.

And, Yes... Substring was a good move on your part.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RBarryYoung SSC Guru Points: 143329 More actions · Answer 2

I am parsing 1MB strings in about 6 seconds on my laptop with the following function:

ALTER function [dbo].[fnSplit3](

@parameter varchar(Max) -- the string to split

, @Seperator Varchar(64) -- the string to use as a seperator

)

RETURNS @Items TABLE(

ID INT -- the element number

, item VARCHAR(8000) -- the split-out string element

, OffSet int -- the original offest

--( not entirley accurate if LEN(@Seperator) > 1 because of the Replace() )

)

AS

BEGIN

/*

"Monster" Split in SQL Server 2005; From Jeff Moden, 2008/05/22

BYoung, 2008/06/18: Modified to be a Table-Valued Function

And to handle CL/LF or LF-only line breaks

(Note: making it inline made it slower, not faster)

Test: (scripts all triggers in your database)

Select Lines.Item

From sys.sql_modules M

Join sys.objects O on O.object_id = M.object_id

cross apply dbo.fnSplit1(M.definition, char(13)+char(10)) Lines

Where O.Type = 'TR'

Order by O.create_date, Lines.ID

*/

Declare @Sep char(1)

Set @Sep = char(10) --our seperator character (convenient, doesn't affect performance)

--NOTE: we make the @Sep character LF so that we will automatically

-- parse out rogue LF-only line breaks.

--===== Add start and end seprators to the Parameter so we can handle

-- all the elements the same way

-- Also change the seperator expressions to our seperator

-- character to keep all offsets = 1

SET @Parameter = @Sep+ Replace(@Parameter,@Seperator,@Sep) +@Sep

-- This reduces run-time about 10%

;WITH cteTally AS

(--==== Create a Tally CTE from 1 to whatever the length

-- of the parameter is

SELECT TOP (LEN(@Parameter))

ROW_NUMBER() OVER (ORDER BY t1.ID) AS N

FROM Master.sys.sysColumns t1

CROSS JOIN Master.sys.sysColumns t2

)

INSERT into @Items

SELECT ROW_NUMBER() OVER (ORDER BY N) AS Number,

SUBSTRING(@Parameter, N+1, CHARINDEX(@Sep, @Parameter, N+1)-N-1) AS Value

, N+1

FROM cteTally

WHERE N < LEN(@Parameter)

AND SUBSTRING(@Parameter, N, 1) = @Sep --Notice how we find the seperator

Return

END

Here is a test script that demonstrates this:

declare @CrLf char(2), @BigScript varchar(MAX)

Select @CrLf = char(13) + char(10), @BigScript = ''

--create the test String: 1MB

Select @BigScript = @BigScript+ @CrLf+ Definition

From Master.sys.All_SQL_Modules

Where Left(object_name(object_id), 4) = 'sp_M'

And Len(Definition) > 10000

Select @BigScript = Left(@BigScript, 1000000)

--since the prep time shouldn't count against the split-time

-- , display current time, now:

Select Len(@BigScript), Getdate()

--Split the string

Select Item

From dbo.fnSplit3(@BigScript, @CrLf)

Order by ID

--redisplay the time, for comparison

Select Getdate()

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 3

Here's the thing, though.... Eric said...

emamet (7/4/2008)
Dear all,
I am trying to write a stored procedure that splits a [font="Arial Black"]file content [/font](sent as one VARCHAR(MAX) parameter) into as many rows as there are lines.

I want to know if it actually came from a file because there's a hell of a lot better ways to split on ^C_R^L_F if it does...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RBarryYoung SSC Guru Points: 143329 More actions · Answer 4

Yeah, I was wondering about that, but he also said "sent as one Varchar(max) parameter...", so it's a little contradictory. Anyway, I'm pretty sure that I'm going to fall asleep before you get your answer and I'm not sure if I'll be on tomorrow, so I posted.

If it applies, fine. If not, I'm sure it'll be ovbious soon enough.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Eric Mamet SSChampion Points: 11728 More actions · Answer 5

Sorry for the delay and thanks for answering...

I was still enjoying the weekend!

I'll try to answer the various questions:

1) Why returning just the separators? (Jeff)

I was simply extracting the bit that appeared slow.

After that, I use my separators in a set based way to extract the lines and do various processing but the rest performs properly.

2) dbo.fnSplit3 (RBarryYoung)

Using your functions takes 5 seconds on my machine while my stored proc takes 19 seconds when using your @BigScript...

Clearly, it works much better. I'll try to understand why!

3) Was the data in a file? (Jeff)

Yes, the initial data comes as a file and is manipulated by a C# program. It was my idea to do the first bit in one SQL chunk because

- We were told initially that the file could be about 200 lines long (not 1Mb!) and I did not think I would have any speed problem. For some reason, I am now told that it could be 6000 lines instead of 200...

- There is a bit more processing to do than what I extracted and doing everything in one stored procedure allowed me to keep it all in one transaction without going up and down the network (I like to keep processing during transactions on the server). This removed a fair amount of potential consistency/recovery problems

- The C# program is running on an Application Server machine (out of a pool) that does not have SSIS and we are not a very "SQL Server" friendly environment (the culture here is "Object Oriented") so I don't think SSIS would be welcome

- I don't know whether I could make the File available to SQL Server (for bulk copy for instance)

My bottom line is that this is currently processed via ADO.Net, hence I have a (default) timeout of 30 seconds.

I think my current performance could be just about "enough" but I would still increase my ADO.Net Command Timeout to 1 minute just in case.

However, if I understand why dbo.fnSplit3 performs so much better, I should have an overall performance well within my "acceptable" limits.

Eric 🙂

Eric Mamet SSChampion Points: 11728 More actions · Answer 6

Jeff,

Here is how I defined and populated my Number table.

CREATE TABLE NetFinance.Number(n_Id INT NOT NULL PRIMARY KEY);

GO

DECLARE @max-2 AS INT, @RowCount AS INT;

SET @max-2 = 1000000; -- 1,000,000 rows should be enough for the time being

SET @RowCount = 1;

INSERT INTO NetFinance.Number VALUES(1);

WHILE @RowCount * 2 <= @max-2

BEGIN

INSERT INTO NetFinance.Number

SELECT n_Id + @RowCount FROM NetFinance.Number;

SET @RowCount = @RowCount * 2;

END

INSERT INTO NetFinance.Number

SELECT n_Id + @RowCount

FROM NetFinance.Number

WHERE n_Id + @RowCount <= @max-2;

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 7

rbarryyoung (7/7/2008)
Yeah, I was wondering about that, but he also said "sent as one Varchar(max) parameter...", so it's a little contradictory. Anyway, I'm pretty sure that I'm going to fall asleep before you get your answer and I'm not sure if I'll be on tomorrow, so I posted.
If it applies, fine. If not, I'm sure it'll be ovbious soon enough.

Not a problem, Barry... wasn't directed at whether your post was good or not (it always is). 🙂 I just wanted Eric to explain what the actual source of the information was because it seems odd to split lines this way...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 8

emamet (7/7/2008)
Yes, the initial data comes as a file and is manipulated by a C# program. It was my idea to do the first bit in one SQL chunk because
- We were told initially that the file could be about 200 lines long (not 1Mb!) and I did not think I would have any speed problem. For some reason, I am now told that it could be 6000 lines instead of 200...
- There is a bit more processing to do than what I extracted and doing everything in one stored procedure allowed me to keep it all in one transaction without going up and down the network (I like to keep processing during transactions on the server). This removed a fair amount of potential consistency/recovery problems
- The C# program is running on an Application Server machine (out of a pool) that does not have SSIS and we are not a very "SQL Server" friendly environment (the culture here is "Object Oriented") so I don't think SSIS would be welcome
- I don't know whether I could make the File available to SQL Server (for bulk copy for instance)

You hit the nail on the head... BULK INSERT from, say, a regular ol' stored procedure would blow the socks off any of the methods you've tried so far... 6000 rows would import in subsecond times.

Barry's method for doing the split is ok... if, for some reason, that's still not good enough, post back... I'd need to know that expected maximum length for any row, as well. Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eric Mamet SSChampion Points: 11728 More actions · Answer 9

Hi Jeff,

I can get better performance by using a function copied from Barry's but I just don't have a clue why it behaves better... which is pissing me off!

If I comment out the line

SET @Parameter = @Sep+ Replace(@Parameter,@Seperator,@Sep) +@Sep

then the performance drops from about 3 seconds (for the interesting bit) to 20 seconds! Which is the performance I get with my original code.

I know I don't get exactly the same result because I'll have CHAR(13) hanging around but why is it so much slower???

Looking at Statistics IO and Query Plans, I can't even spot any difference.

The only obvious thing is the "Statistics Time" which tells me that the difference is simply burning more CPU...

Please... God... Help me! 😀

PS: this is using Barry's data @BigScript

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 10

Part of the problem is that you're not measuring time correctly... you're including the time it takes for Barry to build his example... I added a bit to Barry's existing code to more accurately measure the amount of time it actually takes for the function to do it's work...

SET STATISTICS TIME ON

--Split the string

Select Item

From dbo.fnSplit3(@BigScript, @CrLf)

Order by ID

SET STATISTICS TIME OFF

... and, I get the following results...

[font="Courier New"](1 row(s) affected)

(23433 row(s) affected)

SQL Server Execution Times:

CPU time = 5546 ms, elapsed time = 6471 ms.[/font]

Couple of other things... I've found that using a "Tally CTE" for bigger counts actually works better/faster than a large Tally table and I've published that fact. Also, your Numbers table has no Clustered Primary Key... it sometimes helps.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eric Mamet SSChampion Points: 11728 More actions · Answer 11

Jeff,

The timing I took is, as Barry, a simple

SELECT GETDATE() AS [Before]

followed by a

SELECT GETDATE() AS [After]

precisely to avoid the setup bit and it gives me 3 seconds with the dbo.fnSplit3 untouched while this jumps to 20+ seconds if I just comment out the REPLACE line.

My tally table is defined with a Primary Key, hence clustered index on n_Id but I get the same result using Barry's tally CTE.

Bizarre bizarre... :crazy:

RBarryYoung SSC Guru Points: 143329 More actions · Answer 12

Heh. I don't mean to laugh, but I just spent the last three days trying to figure out the same thing.

First, credit where credit is due: Jeff Moden is being too modest in his replies because most of the fast parts of this are his.

As for fnSplit3's performance: I have tried everything that I can think of that should make it faster and instead it just makes it slower.

I tried making it an inline Table-Valued Function: Nope, that's about 10x slower than the procedureal vesion (this one).

I actually pass SQL_Modules.Definition column values directly as my inputs, which are NVarchar(MAX)'s, so I wrote a version with all NVarchar parameters and NChar variables to make it faster. Nope, the Varchar version is faster, even though I am passing it NVarchar values!

As you just found out, taking out the REPLACE, also makes it slower! Now that part is mine, but it's an accident: I needed it for functional reasons, I had no idea it would speed it up!

And it's only in the last month or so that I think that I understand why Jeff's Syscolumns trick is faster than using custom Tally/Numbers tables, but there are still some holes in that theory.

If Ken Henderson, God rest his soul, were still with us, I would ask him. He helped me to figure somethings out a couple times, years ago. But today, I don't know anyone with both the internals knowledge and the willingness to figure it out. Maybe some shiny new MVP could ask someone on the MS SQL Team? (hint, hint)

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Eric Mamet SSChampion Points: 11728 More actions · Answer 13

Cool, it's now official: I did not get insane yet... 😛

This was really driving me up the wall!

Although I don't understand why your function works the way it is, I'll use it.

By the way, (for potential futur readers) I do realise how much faster Bulk Copy would be but it is not an option in this particular project, due to Runtime considerations (Network access from the Db Server, etc).

Many thanks for your help.

I hope I did not spoil your week-end... There was a tremendous Tennis Final in wimbledon

Jeff Moden SSC Guru Points: 1004312 More actions · Answer 14

Ok... now, to understand how the function works, you have to read the following...

http://www.sqlservercentral.com/articles/TSQL/62867/

If you want a super deep knowledge of how to do splits well beyond what's in that article, then try this one...

http://www.sqlservercentral.com/articles/T-SQL/63003/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)