Select a string when data in the string varies

  • WayneS (4/14/2010)


    bill.brazell (4/14/2010)


    This is looking very promising. Thank you.

    As for the output, I'd prefer that it not include the name of the delimiter, only the value. i.e. "Ping" instead of "Cmd=Ping" Is this doable? Thanks again!

    Yes it is. And this is a perfect reason why you should have included the expected results in your original request... it would have been done right the first time, instead of having to come back to it. Please read the first link in my signature for how to post data for faster, tested results that do what you want, and just be sure to include how you want the results.

    Anyway, change the last select to this:

    SELECT t1.date, t1.time,

    [Cmd] = MAX(CASE WHEN LEFT(Item,4) = 'Cmd=' THEN SUBSTRING(Item, 5, 50) ELSE NULL END),

    = MAX(CASE WHEN LEFT(Item,5) = 'User=' THEN SUBSTRING(Item, 6, 50) ElSE NULL END),

    [DeviceID] = MAX(CASE WHEN LEFT(Item,9) = 'DeviceID=' THEN SUBSTRING(Item, 10, 50) ELSE NULL END),

    [DeviceType] = MAX(CASE WHEN LEFT(Item, 11) = 'DeviceType=' THEN SUBSTRING(Item, 12, 50) ELSE NULL END),

    [LOG] = MAX(CASE WHEN LEFT(Item,4) = 'Log=' THEN SUBSTRING(Item, 5, 50) ELSE NULL END)

    FROM #mytable t1

    CROSS APPLY DelimitedSplit(csuriquery, @Delimiter) t2

    GROUP BY t1.date, t1.time

    ORDER BY t1.date, t1.time

    Wayne... just split on the "=" sign using SUBSTRING and CHARINDEX. It doesn't need to be Pivoted using a CrossTab to be used very effectively.

    As for the data... it looks like a stream from some form of automation and probably can't be changed.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/14/2010)


    Wayne... just split on the "=" sign using SUBSTRING and CHARINDEX. It doesn't need to be Pivoted using a CrossTab to be used very effectively.

    I utilized the method I did for these reasons:

    1. The delimited data being stripped out and being put into columns is in random order (and we don't know that all 5 "columns" will be in every string, though it is in the test data).

    2. Since I'm already having to test to see which column the data belongs to, at that point it would be faster to use the hard-coded literal in the SubString call instead of having to calculate it in the CharIndex one... since I know it, just use it and save a clock cycle or two.

    That's why I did it that way. Now, I'm interested in seeing what you had in mind (especially the part about not needing to be pivoted) - I'm just not seeing how to do this any other way. So... if you don't mind teaching, I'm ready to learn!

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (4/14/2010)


    Jeff Moden (4/14/2010)


    Wayne... just split on the "=" sign using SUBSTRING and CHARINDEX. It doesn't need to be Pivoted using a CrossTab to be used very effectively.

    I utilized the method I did for these reasons:

    1. The delimited data being stripped out and being put into columns is in random order (and we don't know that all 5 "columns" will be in every string, though it is in the test data).

    2. Since I'm already having to test to see which column the data belongs to, at that point it would be faster to use the hard-coded literal in the SubString call instead of having to calculate it in the CharIndex one... since I know it, just use it and save a clock cycle or two.

    That's why I did it that way. Now, I'm interested in seeing what you had in mind (especially the part about not needing to be pivoted) - I'm just not seeing how to do this any other way. So... if you don't mind teaching, I'm ready to learn!

    Sorry... got home from work at 1 this morning... wee bit of a crunch needless to say. Just checking for "quick" responses with morning coffee before work... I'll have to get to this tonight after work. The code is easy... explaining it will not be.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's what I was talking about to simplify the split at the "=" sign...

    --=================================================================================================

    -- Test table setup. This is NOT a part of the solution.

    --=================================================================================================

    --===== Conditionally drop the test table to make reruns easier

    IF OBJECT_ID('TempDB..#YourTable','U') IS NOT NULL

    DROP TABLE #YourTable

    ;

    GO

    --===== Create the test table

    CREATE TABLE #YourTable

    (

    Date NCHAR(10) NOT NULL,

    Time NCHAR(8) NOT NULL,

    CsuriQuery NVARCHAR(2000) NOT NULL

    )

    ;

    --===== Populate the test table with test data

    INSERT INTO #YourTable

    (Date, Time, CsuriQuery)

    SELECT '2010-04-08','00:00:02','Cmd=Ping&User=User1_ID&DeviceId=PALM24854ace5ac4e080c918b7213ce4&DeviceType=Palm&Log=V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1_' UNION ALL

    SELECT '2010-04-08','00:00:03','User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_' UNION ALL

    SELECT '2010-04-08','00:00:04','NewParameter=Whatever&User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_'

    ;

    --=================================================================================================

    -- Split the parameter names and values into an "NVP" (Name/Value Pair) table.

    -- Yes... it's denormalized but it will "auto-magically" capture new/additional parameters

    -- without any changes to the code as it does for line "3" in the test data above.

    --=================================================================================================

    --===== Finish the split at the "=" sign to put the ParameterName and PrameterValue in separate

    -- columns.

    SELECT yt1.Date,

    yt1.Time,

    SUBSTRING(ca.Item, 1, CHARINDEX(N'=',ca.Item)-1) AS ParameterName,

    SUBSTRING(ca.Item, CHARINDEX(N'=',ca.Item)+1, 4000) AS ParameterValue

    FROM #YourTable yt1

    CROSS APPLY

    (--==== This does the main split returning parameter names and values separated by "="

    SELECT yt2.Item

    FROM dbo.DelimitedSplit8K(yt1.CsuriQuery, N'&') yt2

    ) ca

    ORDER BY yt1.Date, yt1.Time, ParameterName

    Of course, the details are in the comments in the code. 😉

    If you've just gotta have it all horizontal and all normalized and all, then to keep the ability to "auto-magically" handling new/different parameters you'd have to use a little dynamic SQL to make a dynamic Cross-Tab. If you don't mind losing that ability, then the additional split in the code above still makes doing a static Cross-Tab much easier...

    --=================================================================================================

    -- Test table setup. This is NOT a part of the solution. (Same as previous example)

    --=================================================================================================

    --===== Conditionally drop the test table to make reruns easier

    IF OBJECT_ID('TempDB..#YourTable','U') IS NOT NULL

    DROP TABLE #YourTable

    ;

    GO

    --===== Create the test table

    CREATE TABLE #YourTable

    (

    Date NCHAR(10) NOT NULL,

    Time NCHAR(8) NOT NULL,

    CsuriQuery NVARCHAR(2000) NOT NULL

    )

    ;

    --===== Populate the test table with test data

    INSERT INTO #YourTable

    (Date, Time, CsuriQuery)

    SELECT '2010-04-08','00:00:02','Cmd=Ping&User=User1_ID&DeviceId=PALM24854ace5ac4e080c918b7213ce4&DeviceType=Palm&Log=V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1_' UNION ALL

    SELECT '2010-04-08','00:00:03','User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_' UNION ALL

    SELECT '2010-04-08','00:00:04','NewParameter=Whatever&User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_'

    ;

    --=================================================================================================

    -- Split the parameter names and values into an "NVP" (Name/Value Pair) table.

    -- Yes... it's denormalized but it will "auto-magically" capture new/additional parameters

    -- without any changes to the code as it does for line "3" in the test data above.

    -- Changed the SELECT to a CTE so we can use it to drive the Cross-Tab.

    --=================================================================================================

    WITH

    cteSplit AS

    ( --=== Finish the split at the "=" sign to put the ParameterName and PrameterValue in separate

    -- columns.

    SELECT yt1.Date,

    yt1.Time,

    SUBSTRING(ca.Item, 1, CHARINDEX(N'=',ca.Item)-1) AS ParameterName,

    SUBSTRING(ca.Item, CHARINDEX(N'=',ca.Item)+1, 4000) AS ParameterValue

    FROM #YourTable yt1

    CROSS APPLY

    (--==== This does the main split returning parameter names and values separated by "="

    SELECT yt2.Item

    FROM dbo.DelimitedSplit8K(yt1.CsuriQuery, N'&') yt2

    ) ca

    )

    SELECT Date,

    Time,

    MAX(CASE WHEN ParameterName = 'Cmd' THEN ParameterValue END) AS Cmd,

    MAX(CASE WHEN ParameterName = 'DeviceId' THEN ParameterValue END) AS DeviceId,

    MAX(CASE WHEN ParameterName = 'DeviceType' THEN ParameterValue END) AS DeviceType,

    MAX(CASE WHEN ParameterName = 'Log' THEN ParameterValue END) AS [Log],

    MAX(CASE WHEN ParameterName = 'User' THEN ParameterValue END) AS [User],

    MAX(CASE WHEN ParameterName = 'NewParameter' THEN ParameterValue END) AS NewParameter

    FROM cteSplit

    GROUP BY Date, Time

    ORDER BY Date, Time

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's the output from the first bit of code above...

    Date Time ParameterName ParameterValue

    ---------- -------- ------------- -----------------------------------------------------------------------

    2010-04-08 00:00:02 Cmd Ping

    2010-04-08 00:00:02 User User1_ID

    2010-04-08 00:00:02 DeviceId PALM24854ace5ac4e080c918b7213ce4

    2010-04-08 00:00:02 DeviceType Palm

    2010-04-08 00:00:02 Log V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1_

    2010-04-08 00:00:03 User User2_ID

    2010-04-08 00:00:03 DeviceId Appl84930CUZ3NP

    2010-04-08 00:00:03 DeviceType iPhone

    2010-04-08 00:00:03 Cmd Ping

    2010-04-08 00:00:03 Log V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_

    2010-04-08 00:00:04 NewParameter Whatever

    2010-04-08 00:00:04 User User2_ID

    2010-04-08 00:00:04 DeviceId Appl84930CUZ3NP

    2010-04-08 00:00:04 DeviceType iPhone

    2010-04-08 00:00:04 Cmd Ping

    2010-04-08 00:00:04 Log V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_

    Here's the output from the second bit of code above...

    Date Time Cmd DeviceId DeviceType Log User NewParameter

    2010-04-08 00:00:02 Ping PALM24854ace5ac4e080c918b7213ce4 Palm V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1 User1_ID NULL

    2010-04-08 00:00:03 Ping Appl84930CUZ3NP iPhone V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_ User2_ID NULL

    2010-04-08 00:00:04 Ping Appl84930CUZ3NP iPhone V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_ User2_ID Whatever

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey folks (especially Wayne). I ran into a bit of a buzz saw with the new split function that I previously posted (and have removed from this thread). It turns out that it's quite machine dependent. I ran the new code on a million row split at work and it came in at 4:18. To see how much faster it was than the old code, I ran the old code on the identical scenario and data... it came in at 3:48!!! In other words, the old code similar to what Wayne used came in a full 30 seconds faster on a lousy million rows.

    Go back to the old code, Wayne. It's not machine dependent like the new code seems to be.

    I appologize for the confusion.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • T-SQL is not the optimal choice for parsing text files - and IIS logs in particular.

    Microsoft provide a specific tool for this purpose - Log Parser.

    I once had to analyse many tens of billions of lines of IIS log files - using Log Parser on another machine, and then bulk loading the results in SQL Server made life considerably easier, and the process many times faster, and more scalable.

    Log Parser is an extremely powerful tool, and supports a SQL-like language.

    Download: http://www.microsoft.com/downloads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en

    Tutorial: http://www.msexchange.org/tutorials/Using-Logparser-Utility-Analyze-ExchangeIIS-Logs.html

  • Or, you can just do it in T-SQL and not have to worry about it much. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/19/2010)


    Or, you can just do it in T-SQL and not have to worry about it much. 😉

    It depends. If you only have a few files to process (or if performance is not important, and the SQL Server couldn't be better used for non-file-manipulation tasks) sure, hack it in T-SQL!

    That doesn't make it a good solution - just one that works. 😉

    Log Parser is built specifically for this task.

  • I'll guess I'll have to test the performance of having a separate app handle the file first... it may be the missing link that BCP and BULK INSERT don't have. Thanks for the link to Log Parser... I'll take a look.

    BTW... if Log Parser breaks (for some reason) or does something wierd, does Microsoft actually support it or is it an "unsupported hack" itself?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @bill.brazell,

    Bill, I have a question about your data, please. Obviously, you have this data in a table already. My question would be, what process steps were used to get it there? Thanks for the info...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The data is imported from a flat file. Nothing tricky here - just mapped the fields. I still haven't finalized the process of getting the logs copied from each of the CAS in the company. I may look into ODBC logging, but it may be a process hog, or so I have read.

    We have been using LogParser for a while now. What I found recently is that it failed to produce output from certain logs. I first suspected the size since one file was over 300MB. I split it in half. Still no output. Split it in thirds. Only 2 of the 3 produced output. When LogParser fails, it doesn't give an indication why (not that I could find). That's when I started taking a closer look at using SQL.

    Ideally, the data could be used for real-time analysis and troubleshooting. I have a feeling that would be next to impossible given the number of devices and that the data is spread across the country.

    Thanks again for the code snips. I'll run it against a larger data set and see how it goes. Cheers!

  • Thanks for the excellent information especially about LogParser and file size. I don't know if LogParser suffers from the same problem as many other such products but the ones that have file size limits typically have those limits because the authors of the software make the mistake of trying to read all of the data into memory all at once instead of the tried and true method of...

    1. Open the source file for read.

    2. Open the desination file for write.

    3. Read a row

    4. Process the row

    5. Write the row

    6. Loop back to step 3 until done

    7. Close both files and quit.

    The code snippet that I wrote should be able to handle several hundred thousand rows without much of a problem and with some pretty good speed. If you run into a performance problem, please feel free to post back on this thread with those problems.

    Either way, I'd love to hear what you finally end up with and what the perceived performance is because I find myself doing this type of stuff for large ETL projects a lot.

    Thanks again for the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I still don't know if it's practical to move around that much data, but at least we have some fat WAN pipes. One environment generates over 300MB/day in CAS logs. I'm guessing somewhere near 0.5GB for the whole company...daily.

    OK. I feel that I'm getting closer on this, but need some more help please. I'm trying to run the script against a "real" table (called Raw_Logs). I've changed the references to #mytable to Raw_Logs and ran it again. The date and time come across fine, but the columns that DelimitedSplit returns are NULL. I've tried some variations but can't determine where it's trying to look for the data. Let me know if you want a script to create/populate a table like the "real" one I'm testing against.

    Thanks!

  • Bill,

    Yeah, at this point, we need some more information. Please post the code for the split function you're actually using so we can check it for errors, the table definition of the "real" table, AND, if there's nothing private, proprietary, or hackable in the table (once you've loaded the table), any ten rows of the actual data from the table (we only need the pertinent columns). Please see the article at the first link in my signature below for how to do that so the data is "readily consumable".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply