June 26, 2023 at 7:04 pm
Thanks Phil, that suggestion is just what I needed. For some reason when I was using vbscript, I was using that method - stream it a line at a time - but when I switched over to .Net, an environment I am much less familiar with, as soon as I discovered the ReadAllLines method I guess I got starry-eyed and forgot there was a line by line alternative.
Yes - I'll try that. Chances are pretty strong it will be way faster than the old FileSystemObject, even if it's a similarly defined approach! Thank you!!!
June 26, 2023 at 7:19 pm
Sorry if this post duplicates, I had a problem with webpage when clicked Submit.
YES! thanks for that suggestion. when I was using vbscript, I had no trouble seeing it that way and was using line by line stream.
I guess when I got into .net, less familiar territory, and discovered ReadAllLines, I got starry eyed, and the possibility of using that into an array and then doing all-memory started looking like the only way - but for no particular reason. i can definitely stream one line at a time. and from what people have said about .net io functions, it's still likelier to be much faster than the old filesystemobject i was using in vbscript. Thank you for prompting my brain back to thinking of that possibility, I will try it out.
June 27, 2023 at 6:48 am
Glad to help. It would be interesting to hear how your streaming solution performs when compared to other methods, so please post back if you get the chance.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
June 27, 2023 at 6:04 pm
So things are progressing in a very encouraging way! I used a line-by-line streaming method, which appears to open something like a collection of strings that can be for each'ed through. Very happy so far.
The speed is absolutely phenomenally unbelievably good. In usually less than 2 minutes (!), it streams through all lines and creates all 7 text files, 1 with 7,500,000 each, 5 with 7,500,001 each and one with the rest (out of 50,311,210 in source file).
I only have 2 problems to work on. The first I'm sure I eventually will spot my error, but your spotting is welcome if so 🙂
Output files in the 'middle' (not the first, not the last) have 1 extra record - beyond the desired 'chunk' size. I'm sure it is something simple stupid with my loop and the placement of a line, I'll have it sorted soon. Not too worried on my ability to sort this one.
The second one bewilders the cr ( a ) p out of me: No matter how many times I stare at the code, the files, open them in notepad++, etc.,I can't figure it out: So, not counting headers: file1 is 7,500,000, files 2-6 are 7,500,001, and the last file is 5,311,160. The total of those is 50,311,165 .... 45 fewer than required. I've counted them in Notepad++, which visibly appears NOT to have any trouble at all rendering the file on screen, everything looks good. It's a simple pipe delimited, 3-column file with normal linebreaks. notepad++ "show symbol line breaks" shows the CRLF on all lines - including the ones missing in chunked output's. I've counted them using vbscript filesystemobject textstream with a Long incrementer, this agrees with Notepad++.
Thus, I'm pretty certain the counts I've just mentioned are, indeed, the counts in the files.
Either there's a logical problem in my code that I am looking right at and not seeing, Or, there is something else about the linebreaks and the streamreader overall process which is causing this unexpected result. Highly welcome any ideas you have for next steps troubleshoot, or solution.
I tested one of the last couple lines in the large source file, searching for it in the output files, and it was not there; thus, I'm fairly sure that it's the literal "last" 45 lines in the source file which are not being output where I would expect it--on the last output file with < the max chunk size. Nor output anywhere.
I'm sorry to paste something other than true well formed code, but this picture will have to do for now as I am back and forth between my virtual work desktop (no clipboard access here), and my home machine's monitor. Here is my .Net code. dts variables are only the source file path and the desired chunk size.
No errors, everything visibly appears to be working jolly well, just those dam 45 records are missing from the end of source file
July 1, 2023 at 11:29 am
I'd try to understand which rows have been lost. Given the volume of data, this will be a pain, I know. Is it always the same rows which get dropped?
Once you've found them, look at them closely in NP++ with View/Show Symbol/Show all characters turned on. Is there anything special about these rows?
Second thing is to look at the row numbers for the missing rows. Do they occur at the start, middle or end of a batch, or are they random?
If neither of these gives you any clues, I have no idea what you can try next! Perhaps try with lower overall row counts to make debugging faster.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
Viewing 5 posts - 16 through 19 (of 19 total)
You must be logged in to reply to this topic. Login to reply