Using regular expressions and -split to parse log file

  • Hello,

    I am having the hardest time with the below script. I have a log file that contains text like this:

    2/16/2012 12:27:32 AM -- Found file 1776ELIG_20120215.txt for Processing -- Group 1776

    2/16/2012 12:27:32 AM -- File Copied -- Group 1776

    2/16/2012 12:27:38 AM -- 27376 Records IN: 0 A, 0 E, 0 N, 0 F -- Group 1776

    2/16/2012 12:37:42 AM -- 0 Errors found in Eligiblity Load -- Group 1776

    I want to get the results from my script to come back in the format:

    DateTime FileName GroupNumber Errors

    ---------------- -------------------- ------------------- ----------------------

    2/16/2012 12:27:32 AM 1776ELIG_20120215.txt 1776 0

    Below is the script I currently have.

    $path = pushd C:\Scripts

    $path

    $eligibility = @()

    $data = Get-Content -Path "resultsfile.txt"

    for ($i = 0; $i -lt $data.Length; $i += 3)

    {

    $name = ($data[$i] -split 'Found file ')[1]

    $group = ($data[$i] -split 'Processing -- Group ')[1]

    $DateTime = ($data[$i] -split '[ -- ]')[1]

    $eligibility += New-Object PSObject -Property @{

    FileName = $name

    GroupNumber = $group

    DateTime= $Time

    }

    }

    $eligibility | Format-Table

    BTW, I have looked at using match and but don't know how to implement it with my current script (-split).

    This works for date pattern recognition:

    "2/16/2012 12:42:55 AM" -match "\b\d\D\d\d\D\d\d\d\d\s\d\d\S\d\d\S\d\d\s\D\D\b"

    Any assistance is greatly appreciated!

    Sean

    A clever person solves a problem. A wise person avoids it. ~ Einstein
    select cast (0x5365616E204465596F756E67 as varchar(128))

  • This may not be the most optimal solution, but it may be a start ...

    for ($i = 0; $i -lt $data.Length; $i += 4){

    $name = $($data[$i] -split 'Found file ')[1]

    $group = $($data[$i] -split 'Processing -- Group ')[1]

    $DateTime = $($data[$i+3] -split '[ -- ]')[1]

    $obj = new-object psobject

    $obj | add-member -name Name -type noteproperty -value $name

    $obj | add-member -name Group -type noteproperty -value $group

    $obj | add-member -name DateTimE -type noteproperty -value $DateTime

    $eligibility += , $obj

    }

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I worked on this for a minute too. Since you're looking for subsequent lines to add value to your object's row it's back to that nasty loop logic...see if this helps:

    $path = pushd "C:\@\SQL_Scripts\DBA\SQL Server\PowerShell"

    [string]$delimiter = "--"

    $path

    $eligibility = @()

    $data = Get-Content -Path "file.log"

    foreach($line in $data)

    {

    $splitLine = $line.Split($delimiter, [System.StringSplitOptions]::RemoveEmptyEntries)

    if($splitLine[1].StartsWith(' Found file'))

    {

    $dateTime = $splitLine[0]

    $fileName = $splitLine[1].Substring(12).Trim()

    $fileName = $fileName.Split(" ")[0]

    $group = $splitLine[2].Substring(7)

    }

    if($splitLine[1].Contains('Errors found'))

    {

    $errors = $splitLine[1].Trim().Split(" ")[0]

    $eligibility += New-Object PSObject -Property @{

    FileName = $fileName

    GroupNumber = $group

    DateTime= $dateTime

    Errors = $errors

    }

    }

    }

    $eligibility | Format-Table

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Thank you both so much for your help!

    SSCrazy, that script is guenius. I will learn a lot from that script. Life saver!

    A clever person solves a problem. A wise person avoids it. ~ Einstein
    select cast (0x5365616E204465596F756E67 as varchar(128))

  • You guys appear to have a great grasp on working with Regular Expressions and PowerShell. Any resources (books, articles, etc...) that you can recommend?

    BTW, SSCrazy that was cool how you even noticed I was pulling redundant lines in my source and only used the two that contained the data I needed.

    Thanks!

    A clever person solves a problem. A wise person avoids it. ~ Einstein
    select cast (0x5365616E204465596F756E67 as varchar(128))

  • Have a look at this little article for the refs: http://www.sqlservercentral.com/articles/powershell/76405/

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I've not checked any of the other's code (not real good at reading PowerShell code) but make sure they can handle multiple days.

    Just because it's my nature, I thought I'd throw in a T-SQL-only solution. The following code will handle multiple days and groups but only if each group's run occurs only once a day and the complete "set" of rows for each group occurs within the same day. Let us know if you need otherwise.

    DROP TABLE #TempLog

    --===== Simulate loading the log file into a table

    CREATE TABLE #TempLog

    (

    LogEntry VARCHAR(8000)

    )

    ;

    INSERT INTO #TempLog

    (LogEntry)

    SELECT '2/16/2012 12:27:32 AM -- Found file 1776ELIG_20120215.txt for Processing -- Group 1776' UNION ALL

    SELECT '2/16/2012 12:27:32 AM -- File Copied -- Group 1776' UNION ALL

    SELECT '2/16/2012 12:27:38 AM -- 27376 Records IN: 0 A, 0 E, 0 N, 0 F -- Group 1776' UNION ALL

    SELECT '2/16/2012 12:37:42 AM -- 0 Errors found in Eligiblity Load -- Group 1776' UNION ALL

    SELECT '2/16/2012 01:27:32 AM -- Found file 99DODAH_20120215.txt for Processing -- Group 99' UNION ALL

    SELECT '2/16/2012 01:27:32 AM -- File Copied -- Group 99' UNION ALL

    SELECT '2/16/2012 01:27:38 AM -- 12 Records IN: 0 A, 0 E, 0 N, 0 F -- Group 99' UNION ALL

    SELECT '2/16/2012 01:37:42 AM -- 2 Errors found in Eligiblity Load -- Group 99' UNION ALL

    SELECT '2/17/2012 12:28:30 AM -- Found file 1776ELIG_20120216.txt for Processing -- Group 1776' UNION ALL

    SELECT '2/17/2012 12:28:30 AM -- File Copied -- Group 1776' UNION ALL

    SELECT '2/17/2012 12:28:30 AM -- 27384 Records IN: 0 A, 0 E, 0 N, 0 F -- Group 1776' UNION ALL

    SELECT '2/17/2012 12:38:40 AM -- 1 Errors found in Eligiblity Load -- Group 1776' UNION ALL

    SELECT '2/17/2012 01:27:32 AM -- Found file 99DODAH_20120216.txt for Processing -- Group 99' UNION ALL

    SELECT '2/17/2012 01:27:32 AM -- File Copied -- Group 99' UNION ALL

    SELECT '2/17/2012 01:27:38 AM -- 12 Records IN: 0 A, 0 E, 0 N, 0 F -- Group 99' UNION ALL

    SELECT '2/17/2012 01:37:42 AM -- 0 Errors found in Eligiblity Load -- Group 99'

    ;

    --===== Solve the problem for multiple days and groups

    WITH

    cteFindAndParse AS

    (

    SELECT [DateTime] = SUBSTRING(LogEntry,1,21),

    [FileName] = CASE

    WHEN SUBSTRING(LogEntry,26,10) = 'Found file'

    THEN SUBSTRING(LogEntry,37,CHARINDEX(' for',LogEntry)-37)

    END,

    [GroupNumber] = SUBSTRING(LogEntry,CHARINDEX('-- Group',LogEntry,37)+9,8000),

    [Errors] = CASE

    WHEN SUBSTRING(LogEntry,26,10) <> 'Found file'

    THEN SUBSTRING(LogEntry,CHARINDEX('--',LogEntry)+3,8000)

    END

    FROM #TempLog

    WHERE SUBSTRING(LogEntry,26,10) = 'Found file'

    OR LogEntry LIKE '%Errors found%'

    )

    SELECT [DateTime] = MIN([DateTime]),

    [FileName] = MAX([FileName]),

    [GroupNumber],

    [Errors] = MAX(SUBSTRING([Errors],1,CHARINDEX(' Errors',[Errors])-1))

    FROM cteFindAndParse

    GROUP BY DATEDIFF(dd,0,[DateTime]),[GroupNumber]

    ORDER BY [DateTime],[GroupNumber]

    ;

    If I were doing this "for real", I might be tempted to create the TempLog table with a couple of calulated columns so the calculations would be done at "load time" and, depending on the number of log entries, might even be made to take an index to help resolve the sorts being done (although the the ORDER BY sort is actually on a comparatively miniscule number of rows).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TeraByteMe (2/17/2012)


    You guys appear to have a great grasp on working with Regular Expressions and PowerShell. Any resources (books, articles, etc...) that you can recommend?

    I have a fair amount of .NET experience behind me and the expressions allowed in PowerShell are a natural extension of those supported in .NET. If you look hard enough at PowerShell it's possible to see it as a thin veneer on top of .NET...and a very useful and powerful one at that given the lack of a compilation requirement.

    Sorry I do not have any learning references to share for regular expressions but I will say this. The best way to learn PowerShell is to use it, experiment with it, read about it online and work problems on this (or another) online forum. That has worked well for me over the last year and a half.

    BTW, SSCrazy that was cool how you even noticed I was pulling redundant lines in my source and only used the two that contained the data I needed.

    Happy to help. Not that it does not apply at times, but SSCrazy is just my designation on the site based on how many points I have earned from posting and answering questions.

    Your problem is a classic one that IMO is not elegantly solved in current versions of T-SQL. The window frames coming out in SQL Server 2012 may help towards allowing for a scalable T-SQL solution with an elegant and concise syntax. For now, and possibly even after window frames are released and become mainstream, I would stick with PowerShell and avoid the transaction log and other i/o overhead associated with bringing log files into a T-SQL context.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Sorry about the name confusion :-S. I must say that both the solutions (PowerShell & T-SQL) are very elegantly done.

    I do like PowerShell because it is so integrated with .NET and the library aspect. However, T-SQL appears to be more integrated with SSIS. But I suppose you could use both in combination in a package. The thing is there seems to be so many ways to go and combinations to use. I could use stored procedures to update data in other tables based off the captured log data - via executing them through PS - or I could just do the logic and processing through SQLCommand statements via PS. I have to be thinking about performance concerns also because it won't be just one log file, I will be capturing to the table potentially upwards of hundreds.

    Lots to consider. Can't say I am really strong in programming with either T-SQL or PowerShell but your guy's examples are getting me a lot further I was. Thanks so much!

    Sean

    A clever person solves a problem. A wise person avoids it. ~ Einstein
    select cast (0x5365616E204465596F756E67 as varchar(128))

  • opc.three (2/18/2012)


    Your problem is a classic one that IMO is not elegantly solved in current versions of T-SQL. The window frames coming out in SQL Server 2012 may help towards allowing for a scalable T-SQL solution with an elegant and concise syntax. For now, and possibly even after window frames are released and become mainstream, I would stick with PowerShell and avoid the transaction log and other i/o overhead associated with bringing log files into a T-SQL context.

    Heh... of course I'll remind you of a thing called "minimal logging" which could certainly be used here. So far as "elegant" goes, I had one in my living room once. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • TeraByteMe (2/18/2012)


    Sorry about the name confusion :-S. I must say that both the solutions (PowerShell & T-SQL) are very elegantly done.

    You are correct about each code solution's elegance, for what is shown. To elaborate, I was speaking in terms of a scalable T-SQL solution. Jeff only shared a T-SQL solution that will hold up when processing a limited number of rows. The T-SQL solution that will scale up to process a log file with millions of rows will not appear as elegant whereas the PowerShell solution will scale as-is.

    I do like PowerShell because it is so integrated with .NET and the library aspect. However, T-SQL appears to be more integrated with SSIS. But I suppose you could use both in combination in a package. The thing is there seems to be so many ways to go and combinations to use. I could use stored procedures to update data in other tables based off the captured log data - via executing them through PS - or I could just do the logic and processing through SQLCommand statements via PS. I have to be thinking about performance concerns also because it won't be just one log file, I will be capturing to the table potentially upwards of hundreds.

    SSIS is much like PowerShell in that it too can be seen as a thin veneer on top of .NET. SSIS has many functions that make it easy to get data into and out of SQL Server, all of which are implemented in .NET. There is a very slick add-on for SSIS that allows you to code script components in PowerShell, a very handy tool that I have used. http://www.youdidwhatwithtsql.com/powershell-script-task-for-ssis/488

    Lots to consider. Can't say I am really strong in programming with either T-SQL or PowerShell but your guy's examples are getting me a lot further I was. Thanks so much!

    Indeed, lots to consider. T-SQL and PowerShell both have their rightful place in this world, just make sure you define the project's requirements and consider carefully your goals in terms of long-term usability and maintainability. Only then can you choose the right tool for the job at hand.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Jeff Moden (2/18/2012)


    opc.three (2/18/2012)


    Your problem is a classic one that IMO is not elegantly solved in current versions of T-SQL. The window frames coming out in SQL Server 2012 may help towards allowing for a scalable T-SQL solution with an elegant and concise syntax. For now, and possibly even after window frames are released and become mainstream, I would stick with PowerShell and avoid the transaction log and other i/o overhead associated with bringing log files into a T-SQL context.

    Heh... of course I'll remind you of a thing called "minimal logging" which could certainly be used here. So far as "elegant" goes, I had one in my living room once. 😉

    Sure, for logging, but minimal != none, and that is not to mention the additional data committed to the data file that is unnecessary in the presented problem-case.

    I am not sure I get your "living room" comment :ermm:

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Great advise! Again thanks so much!

    opc.three (2/18/2012)


    TeraByteMe (2/18/2012)


    Sorry about the name confusion :-S. I must say that both the solutions (PowerShell & T-SQL) are very elegantly done.

    You are correct about each code solution's elegance, for what is shown. To elaborate, I was speaking in terms of a scalable T-SQL solution. Jeff only shared a T-SQL solution that will hold up when processing a limited number of rows. The T-SQL solution that will scale up to process a log file with millions of rows will not appear as elegant whereas the PowerShell solution will scale as-is.

    I do like PowerShell because it is so integrated with .NET and the library aspect. However, T-SQL appears to be more integrated with SSIS. But I suppose you could use both in combination in a package. The thing is there seems to be so many ways to go and combinations to use. I could use stored procedures to update data in other tables based off the captured log data - via executing them through PS - or I could just do the logic and processing through SQLCommand statements via PS. I have to be thinking about performance concerns also because it won't be just one log file, I will be capturing to the table potentially upwards of hundreds.

    SSIS is much like PowerShell in that it too can be seen as a thin veneer on top of .NET. SSIS has many functions that make it easy to get data into and out of SQL Server, all of which are implemented in .NET. There is a very slick add-on for SSIS that allows you to code script components in PowerShell, a very handy tool that I have used. http://www.youdidwhatwithtsql.com/powershell-script-task-for-ssis/488

    Lots to consider. Can't say I am really strong in programming with either T-SQL or PowerShell but your guy's examples are getting me a lot further I was. Thanks so much!

    Indeed, lots to consider. T-SQL and PowerShell both have their rightful place in this world, just make sure you define the project's requirements and consider carefully your goals in terms of long-term usability and maintainability. Only then can you choose the right tool for the job at hand.

    A clever person solves a problem. A wise person avoids it. ~ Einstein
    select cast (0x5365616E204465596F756E67 as varchar(128))

  • opc.three (2/18/2012)


    Jeff Moden (2/18/2012)


    opc.three (2/18/2012)


    Your problem is a classic one that IMO is not elegantly solved in current versions of T-SQL. The window frames coming out in SQL Server 2012 may help towards allowing for a scalable T-SQL solution with an elegant and concise syntax. For now, and possibly even after window frames are released and become mainstream, I would stick with PowerShell and avoid the transaction log and other i/o overhead associated with bringing log files into a T-SQL context.

    Heh... of course I'll remind you of a thing called "minimal logging" which could certainly be used here. So far as "elegant" goes, I had one in my living room once. 😉

    Sure, for logging, but minimal != none, and that is not to mention the additional data committed to the data file that is unnecessary in the presented problem-case.

    I am not sure I get your "living room" comment :ermm:

    I agree that minimal logging <> no logging but it is very much reduced (miniscule, actually) from normal logging. And, although BULK INSERT won't do it, OPENROWSET can certainly prevent unwanted rows from being loaded into the table if that's really needed.

    Shifting gears, my question remains... I'm not very good at reading PS code. Does the PS code present on this thread, so far, group by dates as the T-SQL solution does?

    For the living room comment, my apologies. It was my rebuttal for your implying that a T-SQL solution wasn't "elegant". Substitute the name of an animal that sounds like "elegant". 😀 It was meant to imply that one man's elegant solution is something more ominous to another. As you'll recall, I'm one of those folks that would rather stay within the confines if T-SQL if at all possible ("possible" means with accuracy and good performance) especially when the data could be saved in a table for future reporting. Yes, I agree, the output of the PS script could be sent to an SQL Server table but I typically try to cut out such "middle men" for multiple reasons. I also know that you don't agree with many of those reasons and that's ok but I thought I'd post a T-SQL-only solution in case someone else might agree with those reasons.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/19/2012)


    I agree that minimal logging <> no logging but it is very much reduced (miniscule, actually) from normal logging. And, although BULK INSERT won't do it, OPENROWSET can certainly prevent unwanted rows from being loaded into the table if that's really needed.

    I disable OPENROWSET on all my systems where possible to reduce the attackable surface area of my instances, but that's a topic for another thread I think.

    Shifting gears, my question remains... I'm not very good at reading PS code. Does the PS code present on this thread, so far, group by dates as the T-SQL solution does?

    I believe it does, because it does not parse then group as the T-SQL does, it collects results as it goes so all lines in the file are visited once (i.e. no expensive GROUP BY-sort or need for supporting indexes). Here are the results of the PowerShell script using your sample log file:

    Here are the T-SQL results (columns order changed to match PS for easier eye-compare):

    For the living room comment, my apologies. It was my rebuttal for your implying that a T-SQL solution wasn't "elegant". Substitute the name of an animal that sounds like "elegant". 😀

    Ahh, I get it now, forgive me for missing that one. I have seen tons of inelegant T-SQL solutions, and I am not just talking about the ones I wrote years ago that I now look at and say "who on Earth wrote that?" 😛 I also see tons of elegant T-SQL solutions as well. All I was saying was that IMO a T-SQL solution that will scale to solve this particular problem is going to end up being verbose and difficult to maintain, i.e. inelegant, and more importantly will be more expensive in terms of i/o.

    It was meant to imply that one man's elegant solution is something more ominous to another.

    No doubt about it. At the end of the day "technical" solutions tend to be developed based on notions that those not responsible for setting the direction may view as a bias...the plight of an architect I suppose. As long as technical decisions can be defended with reason and hard numbers it is to be respected.

    As you'll recall, I'm one of those folks that would rather stay within the confines if T-SQL if at all possible ("possible" means with accuracy and good performance) especially when the data could be saved in a table for future reporting. Yes, I agree, the output of the PS script could be sent to an SQL Server table but I typically try to cut out such "middle men" for multiple reasons. I also know that you don't agree with many of those reasons and that's ok but I thought I'd post a T-SQL-only solution in case someone else might agree with those reasons.

    I do know this, and that's fine, there are more ways of killing a dog than choking him with pudding. In this case I think a pure T-SQL solution is missplaced in dealing with this use-case [as stated], again, just my opinion.

    Edit: fixing images, session issues on the site

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply