Importing 7.4gb csv ssis

Question

Post reply

Importing 7.4gb csv ssis

Kev Wisbey-443237

Right there with Babe

Points: 783
More actions
March 28, 2023 at 10:09 am

#4165909

I am tasked with writing a package to import a series of files. One of the files is 7.4gb and has 229 columns (look I didn't design this). From previous experience I suspect this file will take ages to import. My thought is to split the file into smaller chunks or partition it. I would appreciate some ideas on how to improve the performance.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

frederico_fonseca SSCoach Points: 16167 More actions · Answer 1

splitting the file should be last resource after you try out the remaining options.

how many rows on file
spec of server where SSIS runs
spec of destination SQL server (if not same)
memory available for SSIS to run
destination table details - e.g. number of indexes, if it is partitioned and how, triggers it may have
destination database details - is it in full , bulk or simple recovery mode - and if full can it be changed to bulk while load happens
dataflow - do you need to transform data on the fly between reading file and sending to destination - for example converting from a string to a date - or do you need to lookup other table(s) to get foreign keys to other table(s)

a 7.4 GB file on a small server should only take a few mins to load unless there are lots of indexes on destination table and triggers or if the server is under constant stress.

Using bulk insert (fast load), tablock and correct buffer sizes and rows per batch is the trick.

Grant Fritchey SSC Guru Points: 398909 More actions · Answer 2

You're on the right track. Break it up into parts. If possible, get the log to simple recovery during the operation and keep it clear between chunks of processing. Otherwise, it's all going to be about memory & disk management, so reduce contention there as much as you can. Otherwise, smaller chunks is generally the way to go for this kind of import.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 3

Kev Wisbey-443237 wrote:

I am tasked with writing a package to import a series of files. One of the files is 7.4gb and has 229 columns (look I didn't design this). From previous experience I suspect this file will take ages to import. My thought is to split the file into smaller chunks or partition it. I would appreciate some ideas on how to improve the performance.

Do you know about when and how to make "Minimally Logged" imports happen and why they'll usually make your imports more than twice as fast without going through all the "chunking" strokes you're getting ready to go through? And, don't say "Yes" just because you may have heard the term... do you actually know the requirements for "Minimally Logging" and have ever had code that met the requirements for "Minimal Logging" and successfully executed?

One string hint here is that just being in the SIMPLE Recovery Model does NOT guarantee "Minimal Logging" and importing "chunks" to the same table means that only the first "chunk" CAN be (there are other things that must happen to be "Minimally Logged") if the direct target is a single table. Partitioning and using "SWITCH" can certainly help there (especially if you "GO Parallel" to multiple "switch" tables) but you still haven't determined if you're actually correct about how long the import will take.

You've also not mentioned the automatic sequestration of bad rows for repair/reimport, etc, etc.

If I were doing this, Step 1 would be to make sure I could do it in a "Minimally Logged" fashion and then see how much time and how many resources it actually takes. Like they say, one good test is worth a thousand expert opinions... even if if the expert is yourself. Like I tell folks, the best thing history will tell you is that you need to test again. 😀

Also, do you already have a target table setup? If so, using a BCP Format file will also improve performance even if you're using SSIS. If the target table already exists, it's pretty easy to generate one instead of trying to create one by hand (which is also not as difficult as many would have you believe).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

daytonbrown2 Say Hey Kid Points: 685 More actions · Answer 4

If you are familiar with SSIS, then use that. According to MS and others, SSIS is actually faster than BCP. Although, I've never fully tested myself. That being said, SSIS defaults to a fairly normal batch size (10k maybe). Like others have indicated though, your server specs will determine performance more than anything.

https://www.mssqltips.com/sqlservertip/4241/data-import-performance-comparison-tsql-vs-ssis-for-large-import/

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 5

daytonbrown2 wrote:

If you are familiar with SSIS, then use that. According to MS and others, SSIS is actually faster than BCP. Although, I've never fully tested myself. That being said, SSIS defaults to a fairly normal batch size (10k maybe). Like others have indicated though, your server specs will determine performance more than anything.
https://www.mssqltips.com/sqlservertip/4241/data-import-performance-comparison-tsql-vs-ssis-for-large-import/

That 10 k batch size? Yeah... that's a part of the reason why you don't get "Minimal Logging" except on the first batch.

And I agree with your assessment about BCP. I didn't suggest BCP... just a BCP format file, which will help both SSIS and BULK INSERT, which both use the same code to do the imports.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kev Wisbey-443237 Right there with Babe Points: 783 More actions · Answer 6

Thanks everyone. I passed on suggestions up the food chain but they decided to let it run, 7 hours later...........

frederico_fonseca SSCoach Points: 16167 More actions · Answer 7

Kev Wisbey-443237 wrote:

Thanks everyone. I passed on suggestions up the food chain but they decided to let it run, 7 hours later...........

7 hours for 7GB... they definitely doing something wrong (like not using fast load destination)

ZZartin SSC-Dedicated Points: 30921 More actions · Answer 8

Kev Wisbey-443237 wrote:

Thanks everyone. I passed on suggestions up the food chain but they decided to let it run, 7 hours later...........

Eh... hopefully this is just a one off....

frederico_fonseca SSCoach Points: 16167 More actions · Answer 9

I've just looked at one of our processes - a 16 Million rows file, 5.5GB in size, 50 columns (mix of char, date, ints) - loading from a NAS share to the server (code running on the SQL Server itself) - 5 minutes to do load it. without any special tuning other than use of TABLOCK.

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 10

Kev Wisbey-443237 wrote:

Thanks everyone. I passed on suggestions up the food chain but they decided to let it run, 7 hours later...........

I agree with the others... they're doing something seriously wrong. We don't have enough information to suggest what that "wrong" might be, though. For example, are they loading it into an empty table or one that already has data in it. Does the table have any FKs that point to primary tables? Are there any indexes on the table? Any triggers on the table? Is the datasource on the same sub-new as the server? Etc, etc, etc.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)