March 28, 2023 at 10:09 am
I am tasked with writing a package to import a series of files. One of the files is 7.4gb and has 229 columns (look I didn't design this). From previous experience I suspect this file will take ages to import. My thought is to split the file into smaller chunks or partition it. I would appreciate some ideas on how to improve the performance.
March 28, 2023 at 11:53 am
splitting the file should be last resource after you try out the remaining options.
a 7.4 GB file on a small server should only take a few mins to load unless there are lots of indexes on destination table and triggers or if the server is under constant stress.
Using bulk insert (fast load), tablock and correct buffer sizes and rows per batch is the trick.
March 28, 2023 at 11:56 am
You're on the right track. Break it up into parts. If possible, get the log to simple recovery during the operation and keep it clear between chunks of processing. Otherwise, it's all going to be about memory & disk management, so reduce contention there as much as you can. Otherwise, smaller chunks is generally the way to go for this kind of import.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 28, 2023 at 5:09 pm
I am tasked with writing a package to import a series of files. One of the files is 7.4gb and has 229 columns (look I didn't design this). From previous experience I suspect this file will take ages to import. My thought is to split the file into smaller chunks or partition it. I would appreciate some ideas on how to improve the performance.
Do you know about when and how to make "Minimally Logged" imports happen and why they'll usually make your imports more than twice as fast without going through all the "chunking" strokes you're getting ready to go through? And, don't say "Yes" just because you may have heard the term... do you actually know the requirements for "Minimally Logging" and have ever had code that met the requirements for "Minimal Logging" and successfully executed?
One string hint here is that just being in the SIMPLE Recovery Model does NOT guarantee "Minimal Logging" and importing "chunks" to the same table means that only the first "chunk" CAN be (there are other things that must happen to be "Minimally Logged") if the direct target is a single table. Partitioning and using "SWITCH" can certainly help there (especially if you "GO Parallel" to multiple "switch" tables) but you still haven't determined if you're actually correct about how long the import will take.
You've also not mentioned the automatic sequestration of bad rows for repair/reimport, etc, etc.
If I were doing this, Step 1 would be to make sure I could do it in a "Minimally Logged" fashion and then see how much time and how many resources it actually takes. Like they say, one good test is worth a thousand expert opinions... even if if the expert is yourself. Like I tell folks, the best thing history will tell you is that you need to test again. 😀
Also, do you already have a target table setup? If so, using a BCP Format file will also improve performance even if you're using SSIS. If the target table already exists, it's pretty easy to generate one instead of trying to create one by hand (which is also not as difficult as many would have you believe).
--Jeff Moden
Change is inevitable... Change for the better is not.
March 28, 2023 at 6:49 pm
If you are familiar with SSIS, then use that. According to MS and others, SSIS is actually faster than BCP. Although, I've never fully tested myself. That being said, SSIS defaults to a fairly normal batch size (10k maybe). Like others have indicated though, your server specs will determine performance more than anything.
March 28, 2023 at 10:48 pm
If you are familiar with SSIS, then use that. According to MS and others, SSIS is actually faster than BCP. Although, I've never fully tested myself. That being said, SSIS defaults to a fairly normal batch size (10k maybe). Like others have indicated though, your server specs will determine performance more than anything.
That 10 k batch size? Yeah... that's a part of the reason why you don't get "Minimal Logging" except on the first batch.
And I agree with your assessment about BCP. I didn't suggest BCP... just a BCP format file, which will help both SSIS and BULK INSERT, which both use the same code to do the imports.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 29, 2023 at 6:10 am
Thanks everyone. I passed on suggestions up the food chain but they decided to let it run, 7 hours later...........
March 29, 2023 at 5:07 pm
I've just looked at one of our processes - a 16 Million rows file, 5.5GB in size, 50 columns (mix of char, date, ints) - loading from a NAS share to the server (code running on the SQL Server itself) - 5 minutes to do load it. without any special tuning other than use of TABLOCK.
March 29, 2023 at 5:50 pm
Thanks everyone. I passed on suggestions up the food chain but they decided to let it run, 7 hours later...........
I agree with the others... they're doing something seriously wrong. We don't have enough information to suggest what that "wrong" might be, though. For example, are they loading it into an empty table or one that already has data in it. Does the table have any FKs that point to primary tables? Are there any indexes on the table? Any triggers on the table? Is the datasource on the same sub-new as the server? Etc, etc, etc.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 11 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply