February 15, 2014 at 6:56 am
Hi,
I have about 300mb of text files. Each one is named like XXXXXX_XXX.txt, where the 6 character expression is a location code, and the three character expression is the report code.
The hitch is that I have to append all of the like reports together for loading, but tack on the location code as a new column. I have written PowerShell to do it, but on the larger files it runs like a dead pig. A long dead pig. In fact, I think even its grand-pigs are dead.
Has anyone ever had to do something similar? Any advice? The PS code is posted below using the vb mark up because all the others ones made it look weirder than it is, so if perhaps there's a better way to get where I'm going with that, I'm all ears.
Thanks
Get-ChildItem C:\Users\$env:username\Desktop\$mydir |
Where-Object { $_.Name -match "^[^8]*_101_.*\.txt$" -AND $_.length -gt 0 } |
Foreach-Object { $dbn = $_.Name.Substring(0,6)
Get-Content -Path C:\Users\$env:username\Desktop\$mydir\$_ |
select -skip 1 | Foreach-Object { "$_`t${dbn}" |
Out-File C:\Users\$env:username\Desktop\$mydir\101Master.txt -Append} }
$f = Get-ChildItem C:\Users\$env:username\Desktop\Test\ |
Where-Object {$_.Name -match "^[^8]*_101_.*\.txt$" -AND $_.length -gt 0} | Select-Object -First 1
$b = Get-Content C:\Users\$env:username\Desktop\$mydir\$f -totalcount 1
$b = $b -replace '&','_amp'
$b = $b -replace '\$',"SS"
$b = $b -replace "[^a-zA-Z0-9`t]",""
$b = $b -replace "^\d",""
$b = $b -replace "`t\d","`t"
$b = $b -replace "\W\d+","`t"
$b = $b + "`tdbn"
Set-Content C:\Users\$env:username\Desktop\$mydir\Headers\Master101headers_cleaned.txt -value $b
$a = (Get-Content C:\Users\$env:username\Desktop\$mydir\101Master.txt) | ? {$_.trim() -ne "" }
Set-Content C:\Users\$env:username\Desktop\$mydir\Cleaned\Master101_cleaned.txt -value $b, $a
February 15, 2014 at 7:57 am
Something similar that I once worked on, was on a migration process.
I had to change the recovery model to BULK_LOGGED and expanded the TLog file enough so that there was no addition of VLFs.
You can also disable triggers and constraints for faster load.
Regards,
Igor
Igor Micev,My blog: www.igormicev.com
February 15, 2014 at 8:02 am
Hi Igor,
Loading the files isn't the slow part; processing them for loading is. Mainly, I think, adding the location code to each file for tracking.
Thanks
February 15, 2014 at 8:19 am
It's easy to give advises, but the hard part goes to you. You've already put some effort for a PS script to do it.
SSIS is designed for that, and I think you'll surpass it hurtless. Can you switch your approach then? because adding a column to (all in all) 300 MBs txt file is really a sick operation.
Regards,
Igor
Igor Micev,My blog: www.igormicev.com
February 15, 2014 at 8:43 am
I really hate SSIS and was hoping to avoid it. I guess I could try doing this all in another language that handles large file operations better. Of course, since these are all going to garbage staging tables, I could always generate dynamic import statements with extra variables. Hm.
Who cares about filling up transaction logs, anyway?! :w00t:
February 15, 2014 at 8:49 am
Try to avoid the addition of the column into the files. Whatever approach you apply for that operation, it will be slow.
Doing the same in the db engine is pretty smoother.
Idea: Can you do something like after insertion of data from a text file into a table, you do another operation which will add/update that column you're adding to the text files previously.
Regards,
Igor
Igor Micev,My blog: www.igormicev.com
February 15, 2014 at 8:54 am
sqldriver (2/15/2014)
I really hate SSIS and was hoping to avoid it. I guess I could try doing this all in another language that handles large file operations better. Of course, since these are all going to garbage staging tables, I could always generate dynamic import statements with extra variables. Hm.Who cares about filling up transaction logs, anyway?! :w00t:
The db engine does. I don't how it's set up there, but having high number of small increments may slow down the insertions.
Igor Micev,My blog: www.igormicev.com
February 15, 2014 at 8:55 am
Yeah, I think my plan is going to be using PS to generate a bulk insert for every file, naming the tables after the files before the .txt, and immediately afterwards an alter table add column with a default of a substring of the file name. That sounds cleaner to me.
Thanks
February 15, 2014 at 8:57 am
sqldriver (2/15/2014)
Yeah, I think my plan is going to be using PS to generate a bulk insert for every file, naming the tables after the files before the .txt, and immediately afterwards an alter table add column with a default of a substring of the file name. That sounds cleaner to me.Thanks
I'll be gladly waiting for the improving results 🙂
Regards,
Igor
Igor Micev,My blog: www.igormicev.com
February 16, 2014 at 6:03 pm
Hi
I've never compared my processes against a power shell equivalent (or anything else for that matter:-)), but I have some large text files (around 1.5 GB) that I pre-process before I load them. I use PERL for this. It handles regular expressions and large files without any real issues. Essentially my process does something similar to yours. Read in a line (or more), run some replaces then write to a new file.
I would give you some timings for my conversions, but I don't time them individually and they are also working on zipped files adding a bit more overhead.
I can't say PERL would be quicker, but it could be worth looking at.
February 16, 2014 at 6:44 pm
sqldriver (2/15/2014)
I really hate SSIS and was hoping to avoid it. I guess I could try doing this all in another language that handles large file operations better. Of course, since these are all going to garbage staging tables, I could always generate dynamic import statements with extra variables. Hm.Who cares about filling up transaction logs, anyway?! :w00t:
Heh... I'm right there with you on the subject of SSIS especially if all of these files have the same basic format. Even if they're different by report type, it's still not difficult to do with the likes of BULK INSERT written as a bit of dynamic SQL. You can use xp_Dirtree 'drive-or-unc-path',1,1 to get a list of files and store them in a table.
There's just no need for PoSH for this type of stuff and BULK INSERT is nasty fast.
How many different report codes do you have and how many files do you have for each report code? Is is safe to assume that each file that has the same report code will all be in the same format?
--Jeff Moden
Change is inevitable... Change for the better is not.
February 17, 2014 at 7:37 am
sqldriver (2/15/2014)
I really hate SSIS and was hoping to avoid it. I guess I could try doing this all in another language that handles large file operations better. Of course, since these are all going to garbage staging tables, I could always generate dynamic import statements with extra variables. Hm.Who cares about filling up transaction logs, anyway?! :w00t:
SSIS is perfect for this using a derived column that looks at the file name. I could have set it up in about the time it took you to post.
February 17, 2014 at 8:22 am
Jeff Moden (2/16/2014)
sqldriver (2/15/2014)
I really hate SSIS and was hoping to avoid it. I guess I could try doing this all in another language that handles large file operations better. Of course, since these are all going to garbage staging tables, I could always generate dynamic import statements with extra variables. Hm.Who cares about filling up transaction logs, anyway?! :w00t:
Heh... I'm right there with you on the subject of SSIS especially if all of these files have the same basic format. Even if they're different by report type, it's still not difficult to do with the likes of BULK INSERT written as a bit of dynamic SQL. You can use xp_Dirtree 'drive-or-unc-path',1,1 to get a list of files and store them in a table.
There's just no need for PoSH for this type of stuff and BULK INSERT is nasty fast.
How many different report codes do you have and how many files do you have for each report code? Is is safe to assume that each file that has the same report code will all be in the same format?
There are six report types for ~75 locations. All of the report types have the same layout, which makes this easier. The only issue is that the headers have wonky characters in them that I'm going to have to deal with at some point before loading. I think doing that bit of pre-processing with PS should be fine; it's just one line, even if it is ~75 times.
The thing that makes this situation a little weird is that the instance I'm loading them to lives on AWS :doze: which is why I started down the PS road in the first place: it has a cmdlet to upload files right to S3. It seemed like an awesome idea to append all the files together with some bells and whistles to skip lines and select only one line and etc. and so forth rather than load each file to a staging table and then run a larger insert to a combined staging table for each report type.
I wrote a poem about my experience:
Here I sit, broken hearted
Ran my PowerShell script
Hours later, had to abort it
February 17, 2014 at 8:32 am
mickyT (2/16/2014)
HiI've never compared my processes against a power shell equivalent (or anything else for that matter:-)), but I have some large text files (around 1.5 GB) that I pre-process before I load them. I use PERL for this. It handles regular expressions and large files without any real issues. Essentially my process does something similar to yours. Read in a line (or more), run some replaces then write to a new file.
I would give you some timings for my conversions, but I don't time them individually and they are also working on zipped files adding a bit more overhead.
I can't say PERL would be quicker, but it could be worth looking at.
I prefer to look at PERL through a telescope. With my eyes closed and my fingers jammed in my ears 😀
February 17, 2014 at 10:48 am
sqldriver (2/17/2014)
mickyT (2/16/2014)
HiI've never compared my processes against a power shell equivalent (or anything else for that matter:-)), but I have some large text files (around 1.5 GB) that I pre-process before I load them. I use PERL for this. It handles regular expressions and large files without any real issues. Essentially my process does something similar to yours. Read in a line (or more), run some replaces then write to a new file.
I would give you some timings for my conversions, but I don't time them individually and they are also working on zipped files adding a bit more overhead.
I can't say PERL would be quicker, but it could be worth looking at.
I prefer to look at PERL through a telescope. With my eyes closed and my fingers jammed in my ears 😀
Fair enough 🙂
Viewing 15 posts - 1 through 15 (of 23 total)
You must be logged in to reply to this topic. Login to reply