June 9, 2015 at 11:38 am
Hello all,
I have a table with raw scientific test results in a single field, some of which are over 25Mb field. I need to parse into the field to find and aggregate selected values from the field.
Table structure is
CREATE TABLE [dbo].[Gxxx_Data](
[id] [uniqueidentifier] NOT NULL,
[Status] [nvarchar](50) NULL,
[GxxxItem_ID] [int] NULL,
[Stats_Data] [varbinary](max) NULL,
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
The data, after conversion should look something like this:
fmod.ThreadContext.Inst_Cnt.12120 2040 # 1TS[highlight=#ffff11]mov[/highlight] (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1TS[highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12152 2040 # 1TS[highlight=#ffff11]mov[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }
From which I need to parse and summarize the (Assembler) opcodes (MOV,CMPi, SHR etc...)
I need to parse the large field [Stats_Data] to locate the target data.
The internal result strings are delimited with Char(10), conservative counts are from 64k to over 100k lines in each record.
Is there a way to parse the individual lines into another table (temp) that would be queried/regexed ?
Thank you in advance
Tom in Sacramento - For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
June 9, 2015 at 1:08 pm
The easiest way would be to do things "on the way in". How do these large Stats_Data values get into the table you posted to begin with?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 9, 2015 at 1:26 pm
Hi Jeff,
I agree that would be easiest however that still leaves the existing records, is there a smarter way of dealing with them?
Maybe I should build the search in binary format and patindex to the first 'line' which is usually around 55M bytes into the file. I know that cursors are spoken of badly, but does a cte or set based solution come to mind?
I appreciate your thoughts on this
Tom in Sacramento - For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
June 9, 2015 at 1:45 pm
The internal result strings are delimited with Char(10), conservative counts are from 64k to over 100k lines in each record. ...
Is there a way to parse the individual lines into another table (temp) that would be queried/regexed ?
Since the strings are delimited with a CHAR(10) you could split the strings using Jeff's splitter (referenced in my signature). You would have to modify the splitter to deal with varchar(max). I have a modified (and unofficial) version of it that I use for strings longer that 8K. Using the splitter you could also filter out rows that don't contain the data that you need to aggregate. Then parse the ones that you do accordingly.
Tom_Sacramento (6/9/2015)
Hi Jeff,... is there a smarter way of dealing with them?
Maybe I should build the search in binary format and patindex to the first 'line' which is usually around 55M bytes into the file. I know that cursors are spoken of badly, but does a cte or set based solution come to mind?
I appreciate your thoughts on this
Depending on exactly what you are parsing I am confident that you could parse out the data that you need without Regex using a purely set-based solution. Is it possible for you to post an example of what you need to parse? Also are you saying that that data that you need is in the first "line"?
-- Itzik Ben-Gan 2001
June 9, 2015 at 2:54 pm
Hi Alan,
The converted string (from varbinary) looks like these 3 records (targets around position 40 but may be up to position 73, ergo my thought on REGEX...):
1 2 3 4 5 6
01234567890123456789012345678901234567890123456789012345678901234567890
fmod.ThreadContext.Inst_Cnt.99920 2040 # 1TSmov (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.99936 2040 # 1TSshl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.99952 2040 # 1TSmov (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }
From these 'lines' I need the OpCodes (mnemonics @pos 40).
The row markers are consistent, starting with 'fmod.ThreadContext' of which there are usually 1,000 to 1,500 records of interest
Not sure how split this much data with the 8k splitter
Thank you
Tom in Sacramento - For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
June 9, 2015 at 4:16 pm
Hi Tom,
Understood on the existing rows.
Peeling 1 potato at a time, can you use SQL CLR? The reason I ask is because there's a decent SQL CLR splitter that will do the trick on the first pass on these rows. Lemme know on that.
On the rest of it, I need to know two things, please.
1) What is the primary key column of the table and
2) Given the following row, what exactly do you want returned as part of the Assembly Mnemonic (or otherwise)?
fmod.ThreadContext.Inst_Cnt.99936 2040 # 1 TS shl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
--Jeff Moden
Change is inevitable... Change for the better is not.
June 10, 2015 at 6:07 am
He Jeff,
The primary key is a GUID called ID - not my idea, but this is an inherited database...
What I need is a count of each of the 3 or 4 character OpCodes - they are usually found between position 40 and 73 on a given line. (I put together a list of opcodes in a table in case that is useful since there are only @12 different OpCodes - Add, And, ASL, ASR, Cmp, JmpI, Mov, SHL, SHR...)
The lines translate from varbinary to varchar like this:
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS shl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
Note the first half of the line - all targets start with 'fmod.ThreadContext.Inst_Cnt', followed by more variable text, followed by 5 or more spaces then the OpCode
Thank you,
Tom
Tom in Sacramento - For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
June 10, 2015 at 6:40 am
ok, using pure string manipulation, can you ASSUME, the value you want is between a tab and the first left parenthesis?
someone will pretty this up with a cross apply, which looks prettier than the inline stuff, but the concept seems to hold true.
this would work dynamically, i think.
/*--Results
MySubString | MyReversedAndTrimmed SubStringToTab SubStringToTab
fmod.ThreadContext.Inst_Cnt.99920 2040 # 1 TS mov | vom ST 1 # 0402 02999.tnC_tsnI.txetnoCdaerhT.domf vom mov
fmod.ThreadContext.Inst_Cnt.99936 2040 # 1 TS shl | lhs ST 1 # 0402 63999.tnC_tsnI.txetnoCdaerhT.domf lhs shl
fmod.ThreadContext.Inst_Cnt.99952 2040 # 1 TS mov | vom ST 1 # 0402 25999.tnC_tsnI.txetnoCdaerhT.domf vom mov
*/
--create demo data
IF OBJECT_ID('tempdb.[dbo].[#Gxxx_Data]') IS NOT NULL
DROP TABLE [dbo].[#Gxxx_Data]
CREATE TABLE #Gxxx_Data(
[id] [uniqueidentifier] NOT NULL,
[Status] [nvarchar](50) NULL,
[GxxxItem_ID] [int] NULL,
[Stats_Data] [varbinary](max) NULL)
INSERT INTO #Gxxx_Data([id],[Stats_Data])
SELECT NEWID(),CONVERT(varbinary(max),'fmod.ThreadContext.Inst_Cnt.99920 2040 # 1TSmov (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }') UNION ALL
SELECT NEWID(),CONVERT(varbinary(max),'fmod.ThreadContext.Inst_Cnt.99936 2040 # 1TSshl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }') UNION ALL
SELECT NEWID(),CONVERT(varbinary(max),'fmod.ThreadContext.Inst_Cnt.99952 2040 # 1TSmov (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }');
WITH MyCTE
AS
(
SELECT
id,
CONVERT(VARCHAR(max),[Stats_Data]) AS [Stats_Data],
CHARINDEX('(',CONVERT(VARCHAR(max),[Stats_Data]))-1 As FirstLeftParen
FROM #Gxxx_Data
)
SELECT
LEFT([Stats_Data],FirstLeftParen) AS MySubString,
RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))) As MyReversedAndTrimmed,
LEFT(RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))),CHARINDEX(CHAR(9),RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))))) As SubStringToTab,
REVERSE(LEFT(RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen))),CHARINDEX(CHAR(9),RTRIM(REVERSE(LEFT([Stats_Data],FirstLeftParen)))))) As SubStringToTab
FROM MyCTE
Lowell
June 10, 2015 at 7:39 am
Tom_Sacramento (6/10/2015)
He Jeff,The primary key is a GUID called ID - not my idea, but this is an inherited database...
What I need is a count of each of the 3 or 4 character OpCodes - they are usually found between position 40 and 73 on a given line. (I put together a list of opcodes in a table in case that is useful since there are only @12 different OpCodes - Add, And, ASL, ASR, Cmp, JmpI, Mov, SHL, SHR...)
The lines translate from varbinary to varchar like this:
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS shl (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
Note the first half of the line - all targets start with 'fmod.ThreadContext.Inst_Cnt', followed by more variable text, followed by 5 or more spaces then the OpCode
Thank you,
Tom
Thanks, Tom. Lowell did pretty much the same thing I would have. I'd give that a try. It's a generic solution that relies on the position of the TAB rather than on textual content, which could change.
I guess my only question would be, does the data ever contain a Mnemonic that does not include a left parenthesis?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 10, 2015 at 8:29 am
Thanks Jeff and Lowell,
Just to clarify the data, it looks like this in the table:
0x0A0D302064657070696B5365426F5464656B72614D73766F4D6D754E646E
From this I need to break it into manageable (VARCHAR) pieces, like this into a temp table:
fmod.weSamplerEuBypass.weNumEuBypassSendsApplied 0
fmod.weSamplerEuBypass.weNumEuBypassSendsExecuted 0
Then search each temp table record for one of the known OpCodes
Move to the next temp table record and repeat until the EOF
Jeff - you mentioned a SQLCLR splitter ... since the field is so large(64-100Mb) it sounds like the right tool due to limitations of varchar(max)
Thanks much
Tom in Sacramento - For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
June 10, 2015 at 9:53 am
What I need is a count of each of the 3 or 4 character OpCodes - they are usually found between position 40 and 73 on a given line. (I put together a list of opcodes in a table in case that is useful since there are only @12 different OpCodes
Are you trying to count number of occurances of some particular codes?
If theat is what you really want, you don't need to parse/split your values/
Based on your sample data example you can do just this:
declare @val varchar(max)
set @val = 'fmod.ThreadContext.Inst_Cnt.12120 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12152 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12120 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (1) r3.4<1>:ud r1.2<0;1,0>:ud { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12152 2040 # 1 TS [highlight=#ffff11]mov[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
fmod.ThreadContext.Inst_Cnt.12152 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (8) r5.0<1>:ud r0.0<8;8,1>:ud { Align1, Q1, NoMask, Compacted }
fmod.ThreadContext.Inst_Cnt.12136 2040 # 1 TS [highlight=#ffff11]shl[/highlight] (1) r4.0<1>:d r2.7<0;1,0>:ud 0x50005:uw { Align1, Q1 }
'
-- That will give you 100,000 lines
select top 10 @val = @val + @val
from sys.syscolumns
-- get count (just apply it to your column for each opcode you have)
SELECT (LEN(@val ) - LEN(REPLACE(@val, ']shl[', '')))/LEN(']shl[') AS Count_SHL
,(LEN(@val ) - LEN(REPLACE(@val, ']mov[', '')))/LEN(']mov[') AS Count_MOV
ops. small code correction
even better one:
-- get count (just apply it to your column)
SELECT OpCode, (LEN(@val ) - LEN(REPLACE(@val, ChkVal, '')))/LEN(ChkVal) AS Count
FROM (VALUES ('Add'), ('And'), ('ASL'), ('ASR'), ('Cmp'), ('JmpI'), ('Mov'), ('SHL'), ('SHR')) c(OpCode)
CROSS APPLY (SELECT ']' + OpCode + '[') f(ChkVal)
June 11, 2015 at 11:19 am
Hi Eugene,
I wish it were that easy but the first problem is that the data is in a very large varbinary field(64-100Mb) that contains 50-150,000 lines (delimited Char(10)) as hex:
0x232053696D436F72652E646C6C0D0A232076657273696F6E3A2031352D30342D323820623035393438382028723132383130290D0A23206275696C643A20202031352D30342D3239206C61625F63692052656C656173654943432833322D626974290D0A23204
From this big piece of data I need to break it into lines, that have standard segment markers. Since I can't get PatIndex to deal with Varbinary(max) I'm thinking that I will have to character-by-character read the field until I find the line break (0x0a) and then if the line starts with my target string then write that into a temp table that I can count the OpCodes (1 per line)
Hopefully that clarifies my dilemma
Thank you,
Tom
Tom in Sacramento - For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
June 11, 2015 at 6:39 pm
Tom_Sacramento (6/10/2015)
Thanks Jeff and Lowell,Just to clarify the data, it looks like this in the table:
0x0A0D302064657070696B5365426F5464656B72614D73766F4D6D754E646E
From this I need to break it into manageable (VARCHAR) pieces, like this into a temp table:
fmod.weSamplerEuBypass.weNumEuBypassSendsApplied 0
fmod.weSamplerEuBypass.weNumEuBypassSendsExecuted 0
Then search each temp table record for one of the known OpCodes
Move to the next temp table record and repeat until the EOF
Jeff - you mentioned a SQLCLR splitter ... since the field is so large(64-100Mb) it sounds like the right tool due to limitations of varchar(max)
Thanks much
Apologies for the delay. I'll dig it out and see if it will handle multi-character parameters.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 11, 2015 at 6:58 pm
Crud. I found it but it won't work because it only accepts a single character delimiter. I'll try something else.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 11, 2015 at 7:04 pm
Ah. wait a minute. I've apparently gotten two different threads mixed up. The individual internal "rows" in your large binary are delimited by CHAR(10), correct?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply