Select a string when data in the string varies

Question

Select a string when data in the string varies

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 1

WayneS (4/14/2010)
bill.brazell (4/14/2010)
This is looking very promising. Thank you.
As for the output, I'd prefer that it not include the name of the delimiter, only the value. i.e. "Ping" instead of "Cmd=Ping" Is this doable? Thanks again!
Yes it is. And this is a perfect reason why you should have included the expected results in your original request... it would have been done right the first time, instead of having to come back to it. Please read the first link in my signature for how to post data for faster, tested results that do what you want, and just be sure to include how you want the results.
Anyway, change the last select to this:
SELECT t1.date, t1.time,
[Cmd] = MAX(CASE WHEN LEFT(Item,4) = 'Cmd=' THEN SUBSTRING(Item, 5, 50) ELSE NULL END),
= MAX(CASE WHEN LEFT(Item,5) = 'User=' THEN SUBSTRING(Item, 6, 50) ElSE NULL END),
[DeviceID] = MAX(CASE WHEN LEFT(Item,9) = 'DeviceID=' THEN SUBSTRING(Item, 10, 50) ELSE NULL END),
[DeviceType] = MAX(CASE WHEN LEFT(Item, 11) = 'DeviceType=' THEN SUBSTRING(Item, 12, 50) ELSE NULL END),
[LOG] = MAX(CASE WHEN LEFT(Item,4) = 'Log=' THEN SUBSTRING(Item, 5, 50) ELSE NULL END)
FROM #mytable t1
CROSS APPLY DelimitedSplit(csuriquery, @Delimiter) t2
GROUP BY t1.date, t1.time
ORDER BY t1.date, t1.time

Wayne... just split on the "=" sign using SUBSTRING and CHARINDEX. It doesn't need to be Pivoted using a CrossTab to be used very effectively.

As for the data... it looks like a stream from some form of automation and probably can't be changed.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

WayneS SSC Guru Points: 95460 More actions · Answer 2

Jeff Moden (4/14/2010)
Wayne... just split on the "=" sign using SUBSTRING and CHARINDEX. It doesn't need to be Pivoted using a CrossTab to be used very effectively.

I utilized the method I did for these reasons:

1. The delimited data being stripped out and being put into columns is in random order (and we don't know that all 5 "columns" will be in every string, though it is in the test data).

2. Since I'm already having to test to see which column the data belongs to, at that point it would be faster to use the hard-coded literal in the SubString call instead of having to calculate it in the CharIndex one... since I know it, just use it and save a clock cycle or two.

That's why I did it that way. Now, I'm interested in seeing what you had in mind (especially the part about not needing to be pivoted) - I'm just not seeing how to do this any other way. So... if you don't mind teaching, I'm ready to learn!

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 3

WayneS (4/14/2010)
Jeff Moden (4/14/2010)
Wayne... just split on the "=" sign using SUBSTRING and CHARINDEX. It doesn't need to be Pivoted using a CrossTab to be used very effectively.
I utilized the method I did for these reasons:
1. The delimited data being stripped out and being put into columns is in random order (and we don't know that all 5 "columns" will be in every string, though it is in the test data).
2. Since I'm already having to test to see which column the data belongs to, at that point it would be faster to use the hard-coded literal in the SubString call instead of having to calculate it in the CharIndex one... since I know it, just use it and save a clock cycle or two.
That's why I did it that way. Now, I'm interested in seeing what you had in mind (especially the part about not needing to be pivoted) - I'm just not seeing how to do this any other way. So... if you don't mind teaching, I'm ready to learn!

Sorry... got home from work at 1 this morning... wee bit of a crunch needless to say. Just checking for "quick" responses with morning coffee before work... I'll have to get to this tonight after work. The code is easy... explaining it will not be.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 4

Here's what I was talking about to simplify the split at the "=" sign...

--=================================================================================================

-- Test table setup. This is NOT a part of the solution.

--=================================================================================================

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('TempDB..#YourTable','U') IS NOT NULL

DROP TABLE #YourTable

;

GO

--===== Create the test table

CREATE TABLE #YourTable

(

Date NCHAR(10) NOT NULL,

Time NCHAR(8) NOT NULL,

CsuriQuery NVARCHAR(2000) NOT NULL

)

;

--===== Populate the test table with test data

INSERT INTO #YourTable

(Date, Time, CsuriQuery)

SELECT '2010-04-08','00:00:02','Cmd=Ping&User=User1_ID&DeviceId=PALM24854ace5ac4e080c918b7213ce4&DeviceType=Palm&Log=V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1_' UNION ALL

SELECT '2010-04-08','00:00:03','User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_' UNION ALL

SELECT '2010-04-08','00:00:04','NewParameter=Whatever&User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_'

;

--=================================================================================================

-- Split the parameter names and values into an "NVP" (Name/Value Pair) table.

-- Yes... it's denormalized but it will "auto-magically" capture new/additional parameters

-- without any changes to the code as it does for line "3" in the test data above.

--=================================================================================================

--===== Finish the split at the "=" sign to put the ParameterName and PrameterValue in separate

-- columns.

SELECT yt1.Date,

yt1.Time,

SUBSTRING(ca.Item, 1, CHARINDEX(N'=',ca.Item)-1) AS ParameterName,

SUBSTRING(ca.Item, CHARINDEX(N'=',ca.Item)+1, 4000) AS ParameterValue

FROM #YourTable yt1

CROSS APPLY

(--==== This does the main split returning parameter names and values separated by "="

SELECT yt2.Item

FROM dbo.DelimitedSplit8K(yt1.CsuriQuery, N'&') yt2

) ca

ORDER BY yt1.Date, yt1.Time, ParameterName

Of course, the details are in the comments in the code. 😉

If you've just gotta have it all horizontal and all normalized and all, then to keep the ability to "auto-magically" handling new/different parameters you'd have to use a little dynamic SQL to make a dynamic Cross-Tab. If you don't mind losing that ability, then the additional split in the code above still makes doing a static Cross-Tab much easier...

--=================================================================================================

-- Test table setup. This is NOT a part of the solution. (Same as previous example)

--=================================================================================================

--===== Conditionally drop the test table to make reruns easier

IF OBJECT_ID('TempDB..#YourTable','U') IS NOT NULL

DROP TABLE #YourTable

;

GO

--===== Create the test table

CREATE TABLE #YourTable

(

Date NCHAR(10) NOT NULL,

Time NCHAR(8) NOT NULL,

CsuriQuery NVARCHAR(2000) NOT NULL

)

;

--===== Populate the test table with test data

INSERT INTO #YourTable

(Date, Time, CsuriQuery)

SELECT '2010-04-08','00:00:02','Cmd=Ping&User=User1_ID&DeviceId=PALM24854ace5ac4e080c918b7213ce4&DeviceType=Palm&Log=V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1_' UNION ALL

SELECT '2010-04-08','00:00:03','User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_' UNION ALL

SELECT '2010-04-08','00:00:04','NewParameter=Whatever&User=User2_ID&DeviceId=Appl84930CUZ3NP&DeviceType=iPhone&Cmd=Ping&Log=V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_'

;

--=================================================================================================

-- Split the parameter names and values into an "NVP" (Name/Value Pair) table.

-- Yes... it's denormalized but it will "auto-magically" capture new/additional parameters

-- without any changes to the code as it does for line "3" in the test data above.

-- Changed the SELECT to a CTE so we can use it to drive the Cross-Tab.

--=================================================================================================

WITH

cteSplit AS

( --=== Finish the split at the "=" sign to put the ParameterName and PrameterValue in separate

-- columns.

SELECT yt1.Date,

yt1.Time,

SUBSTRING(ca.Item, 1, CHARINDEX(N'=',ca.Item)-1) AS ParameterName,

SUBSTRING(ca.Item, CHARINDEX(N'=',ca.Item)+1, 4000) AS ParameterValue

FROM #YourTable yt1

CROSS APPLY

(--==== This does the main split returning parameter names and values separated by "="

SELECT yt2.Item

FROM dbo.DelimitedSplit8K(yt1.CsuriQuery, N'&') yt2

) ca

)

SELECT Date,

Time,

MAX(CASE WHEN ParameterName = 'Cmd' THEN ParameterValue END) AS Cmd,

MAX(CASE WHEN ParameterName = 'DeviceId' THEN ParameterValue END) AS DeviceId,

MAX(CASE WHEN ParameterName = 'DeviceType' THEN ParameterValue END) AS DeviceType,

MAX(CASE WHEN ParameterName = 'Log' THEN ParameterValue END) AS [Log],

MAX(CASE WHEN ParameterName = 'User' THEN ParameterValue END) AS [User],

MAX(CASE WHEN ParameterName = 'NewParameter' THEN ParameterValue END) AS NewParameter

FROM cteSplit

GROUP BY Date, Time

ORDER BY Date, Time

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 5

Here's the output from the first bit of code above...

Date Time ParameterName ParameterValue

---------- -------- ------------- -----------------------------------------------------------------------

2010-04-08 00:00:02 Cmd Ping

2010-04-08 00:00:02 User User1_ID

2010-04-08 00:00:02 DeviceId PALM24854ace5ac4e080c918b7213ce4

2010-04-08 00:00:02 DeviceType Palm

2010-04-08 00:00:02 Log V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1_

2010-04-08 00:00:03 User User2_ID

2010-04-08 00:00:03 DeviceId Appl84930CUZ3NP

2010-04-08 00:00:03 DeviceType iPhone

2010-04-08 00:00:03 Cmd Ping

2010-04-08 00:00:03 Log V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_

2010-04-08 00:00:04 NewParameter Whatever

2010-04-08 00:00:04 User User2_ID

2010-04-08 00:00:04 DeviceId Appl84930CUZ3NP

2010-04-08 00:00:04 DeviceType iPhone

2010-04-08 00:00:04 Cmd Ping

2010-04-08 00:00:04 Log V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_

Here's the output from the second bit of code above...

Date Time Cmd DeviceId DeviceType Log User NewParameter

2010-04-08 00:00:02 Ping PALM24854ace5ac4e080c918b7213ce4 Palm V121_LdapC1_LdapL0_RpcC29_RpcL36_Hb1740_Rto1_S1 User1_ID NULL

2010-04-08 00:00:03 Ping Appl84930CUZ3NP iPhone V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_ User2_ID NULL

2010-04-08 00:00:04 Ping Appl84930CUZ3NP iPhone V121_Sst20_LdapC0_LdapL0_RpcC57_RpcL63_Hb975_Rto1_Erq1_Pk4026417530_S1_ User2_ID Whatever

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 6

Hey folks (especially Wayne). I ran into a bit of a buzz saw with the new split function that I previously posted (and have removed from this thread). It turns out that it's quite machine dependent. I ran the new code on a million row split at work and it came in at 4:18. To see how much faster it was than the old code, I ran the old code on the identical scenario and data... it came in at 3:48!!! In other words, the old code similar to what Wayne used came in a full 30 seconds faster on a lousy million rows.

Go back to the old code, Wayne. It's not machine dependent like the new code seems to be.

I appologize for the confusion.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 7

T-SQL is not the optimal choice for parsing text files - and IIS logs in particular.

Microsoft provide a specific tool for this purpose - Log Parser.

I once had to analyse many tens of billions of lines of IIS log files - using Log Parser on another machine, and then bulk loading the results in SQL Server made life considerably easier, and the process many times faster, and more scalable.

Log Parser is an extremely powerful tool, and supports a SQL-like language.

Download: http://www.microsoft.com/downloads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en

Tutorial: http://www.msexchange.org/tutorials/Using-Logparser-Utility-Analyze-ExchangeIIS-Logs.html

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 8

Or, you can just do it in T-SQL and not have to worry about it much. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 9

Jeff Moden (4/19/2010)
Or, you can just do it in T-SQL and not have to worry about it much. 😉

It depends. If you only have a few files to process (or if performance is not important, and the SQL Server couldn't be better used for non-file-manipulation tasks) sure, hack it in T-SQL!

That doesn't make it a good solution - just one that works. 😉

Log Parser is built specifically for this task.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 10

I'll guess I'll have to test the performance of having a separate app handle the file first... it may be the missing link that BCP and BULK INSERT don't have. Thanks for the link to Log Parser... I'll take a look.

BTW... if Log Parser breaks (for some reason) or does something wierd, does Microsoft actually support it or is it an "unsupported hack" itself?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 11

@bill.brazell,

Bill, I have a question about your data, please. Obviously, you have this data in a table already. My question would be, what process steps were used to get it there? Thanks for the info...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bill.brazell SSC Enthusiast Points: 137 More actions · Answer 12

The data is imported from a flat file. Nothing tricky here - just mapped the fields. I still haven't finalized the process of getting the logs copied from each of the CAS in the company. I may look into ODBC logging, but it may be a process hog, or so I have read.

We have been using LogParser for a while now. What I found recently is that it failed to produce output from certain logs. I first suspected the size since one file was over 300MB. I split it in half. Still no output. Split it in thirds. Only 2 of the 3 produced output. When LogParser fails, it doesn't give an indication why (not that I could find). That's when I started taking a closer look at using SQL.

Ideally, the data could be used for real-time analysis and troubleshooting. I have a feeling that would be next to impossible given the number of devices and that the data is spread across the country.

Thanks again for the code snips. I'll run it against a larger data set and see how it goes. Cheers!

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 13

Thanks for the excellent information especially about LogParser and file size. I don't know if LogParser suffers from the same problem as many other such products but the ones that have file size limits typically have those limits because the authors of the software make the mistake of trying to read all of the data into memory all at once instead of the tried and true method of...

1. Open the source file for read.

2. Open the desination file for write.

3. Read a row

4. Process the row

5. Write the row

6. Loop back to step 3 until done

7. Close both files and quit.

The code snippet that I wrote should be able to handle several hundred thousand rows without much of a problem and with some pretty good speed. If you run into a performance problem, please feel free to post back on this thread with those problems.

Either way, I'd love to hear what you finally end up with and what the perceived performance is because I find myself doing this type of stuff for large ETL projects a lot.

Thanks again for the feedback.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

bill.brazell SSC Enthusiast Points: 137 More actions · Answer 14

I still don't know if it's practical to move around that much data, but at least we have some fat WAN pipes. One environment generates over 300MB/day in CAS logs. I'm guessing somewhere near 0.5GB for the whole company...daily.

OK. I feel that I'm getting closer on this, but need some more help please. I'm trying to run the script against a "real" table (called Raw_Logs). I've changed the references to #mytable to Raw_Logs and ran it again. The date and time come across fine, but the columns that DelimitedSplit returns are NULL. I've tried some variations but can't determine where it's trying to look for the data. Let me know if you want a script to create/populate a table like the "real" one I'm testing against.

Thanks!

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 15

Bill,

Yeah, at this point, we need some more information. Please post the code for the split function you're actually using so we can check it for errors, the table definition of the "real" table, AND, if there's nothing private, proprietary, or hackable in the table (once you've loaded the table), any ten rows of the actual data from the table (we only need the pertinent columns). Please see the article at the first link in my signature below for how to do that so the data is "readily consumable".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)