July 31, 2015 at 7:38 am
The data I am processing is normal in reality but a bit messy in SQL world.
I have a table which contains a field to be parsed out possible filename and then its extension, sounds easy, right? no! here is some of the records for the field:
e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof
c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg
c:\users\asdf\music\the cranberrieso need to argue\02 i can't be with you.mp3
e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html
Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party; prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping
01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70; 02/05/72; 02/05/75; 02/06/64; 02/06/66; 02/06/72; 02/07/68; 02/07/75; 02/09/66; 02/09/68; 03/03/75; 03/03/79; 03/04/68; 03/04/74; 03/05/82; 03/07/54; 03/07/75; 03/08/73; 03/08/86; 03/10/65; 03/11/57; 03/12/79; 03/14/42; 04/02/65; 04/03/85; 04/04/60; 04/04/64; 04/04/69; 04/05/38; 04/05/46; 04/05/48; 04/06/60; 04/07/65; 04/10/53; 04/10/72; 04/11/63; 04/11/78; 04/12/64; 04/12/71; 04/18/81; 04/28/37; 05.03.1957; 05/01/66; 05/01/74; 05/02/81; 05/04/1955; 05/04/55; 05/04/75; 05/05/30; 05/06/50; 05/06/78; 05/08/64; 05/08/75; 05/10/66; 05/12/65; 05/15/68; 06/06/63; 06/06/64; 06/07/77; 06/10/66; 06/12/1923; 07/01/36; 07/04/86; 07/06/54; 07/07/56; 07/07/66; 07/09/67; 07/12/62; 07/17/70; 08/02/38; 08/02/69; 08/08/80; 08/10/74; 08/11/65; 09 AUG 1966; 09/02/01; 09/02/51; 09/02/64; 09/02/65; 09/03/86; 09/05/78; 09/05/86; 09/09/85; 09/12/84; 09/19/88; 1 Apr. 1978; 1 Aug. 2007;
Please note each line above indicates a single row for that field.
You can see not all the rows are filenames, my task is:
1. filtered out only records contain filename
2. for those records, further find out what fall in a black list that most likely will be updated from time to time, currently the list contains 3/4 letters extensions:
*.wmv
*.MPG*
*.mkv
*.mpeg
*.ogg
*.m4v
*.3gp
*.AVI*
*.jpeg
*.tiff
*.asf
*.wav
*.ogv
*.tif
*.flv
*.MP3
*.jpg
*.wma
*.oga
The table is generated hourly, weekly analysis is required and for a week it contains roughly 2-3 millions records.
I'm seeking your enlightening to see what would be the best approach to this, any suggestion is appreciated
July 31, 2015 at 7:40 am
what I come out is something like the following script and I am really not satisfied with it, let alone it processes 3 letter extension only and I need to also cover 4 letter extension.
select left(reverse(left(reverse(breach_content),charindex('.',reverse(breach_content))-1)), 3), *
from reporting_dailydlpdetail_20150730
where charindex('.',reverse(breach_content))-1 <> -1
and
left(reverse(left(reverse(breach_content),charindex('.',reverse(breach_content))-1)), 3)
in
('bmp',
'aac',
'mov',
'swf',
'wmv',
'MPG*',
'mkv',
'ogg',
'm4v',
'3gp',
'AVI*',
'asf',
'wav',
'ogv',
'tif',
'flv',
'MP3',
'jpg',
'wma',
'oga')
July 31, 2015 at 1:46 pm
Maybe this will give you some ideas/alternatives:
SELECT
breach_content,
RIGHT(breach_content, CASE WHEN SUBSTRING(breach_content, LEN(breach_content) - 3, 1) = '.' THEN 3 ELSE 4 END) AS extension
FROM (
SELECT 'e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof' AS breach_content UNION ALL
SELECT 'c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg' UNION ALL
SELECT 'c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3' UNION ALL
SELECT 'e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html' UNION ALL
SELECT 'Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;' UNION ALL
SELECT 'prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping' UNION ALL
SELECT '01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;'
) AS test_data
WHERE breach_content LIKE '%[a-z]:%\%.%' AND
'.' IN (SUBSTRING(breach_content, LEN(breach_content) - 3, 1), SUBSTRING(breach_content, LEN(breach_content) - 4, 1))
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
July 31, 2015 at 2:09 pm
ScottPletcher (7/31/2015)
Maybe this will give you some ideas/alternatives:
SELECT
breach_content,
RIGHT(breach_content, CASE WHEN SUBSTRING(breach_content, LEN(breach_content) - 3, 1) = '.' THEN 3 ELSE 4 END) AS extension
FROM (
SELECT 'e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof' AS breach_content UNION ALL
SELECT 'c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg' UNION ALL
SELECT 'c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3' UNION ALL
SELECT 'e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html' UNION ALL
SELECT 'Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;' UNION ALL
SELECT 'prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping' UNION ALL
SELECT '01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;'
) AS test_data
WHERE breach_content LIKE '%[a-z]:%\%.%' AND
'.' IN (SUBSTRING(breach_content, LEN(breach_content) - 3, 1), SUBSTRING(breach_content, LEN(breach_content) - 4, 1))
Much more neat than mine, thank you very much Scott
July 31, 2015 at 2:15 pm
Can you explain '%[a-z]:%\%.%'? My first time to use it, thanks. Basically the sample data is sample only, the real data could be randomly and I have no idea what that could be, but still need to filter to exclude those non-file
July 31, 2015 at 2:27 pm
halifaxdal (7/31/2015)
Can you explain '%[a-z]:%\%.%'? My first time to use it, thanks. Basically the sample data is sample only, the real data could be randomly and I have no idea what that could be, but still need to filter to exclude those non-file
Sure.
'%[a-z]:%\%.%'
% = any character(s), or none at all. So LIKE '%' would match anything.
[a-z] = a character from a to z, that is, any letter
: = exactly a colon
% = again, any character(s), or none at all
\ = exactly a backslash
% = "
. = exactly a dot
% = "
So, to put it all together, to be included in the result, the string must have:
a letter &
followed immediately by : &
followed by \ after any number of chars &
followed by . after any number of chars
Thus, these strings would match:
c:\emp\abc.1
d:\abc.template
user:\file1.ext --tricky, but r: is a letter directly followed by a colon, with ANY char(s) before it
If you want to prevent this last match, you'll need two separate conditions:
LIKE '[a-z]:%\%.%' --meaning the string must start with a letter and a colon
OR LIKE '%[^a-z][a-z]:%\%.%' --EDITed for clarity:meaning anywhere in the string but the char immediately before "letter:" must NOT be a-z
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
July 31, 2015 at 2:41 pm
ScottPletcher (7/31/2015)
halifaxdal (7/31/2015)
Can you explain '%[a-z]:%\%.%'? My first time to use it, thanks. Basically the sample data is sample only, the real data could be randomly and I have no idea what that could be, but still need to filter to exclude those non-fileSure.
'%[a-z]:%\%.%'
% = any character(s), or none at all. So LIKE '%' would match anything.
[a-z] = a character from a to z, that is, any letter
: = exactly a colon
% = again, any character(s), or none at all
\ = exactly a backslash
% = "
. = exactly a dot
% = "
So, to put it all together, to be included in the result, the string must have:
a letter &
followed immediately by : &
followed by \ after any number of chars &
followed by . after any number of chars
Thus, these strings would match:
c:\emp\abc.1
d:\abc.template
user:\file1.ext --tricky, but r: is a letter directly followed by a colon, with ANY char(s) before it
If you want to prevent this last match, you'll need two separate conditions:
LIKE '[a-z]:%\%.%' --meaning the string must start with a letter and a colon,
OR LIKE '%[^a-z][a-z]:%\%.%' --meaning anywhere in the char immediately before the letter: must NOT be a-z
Thanks Scott, I understand now.
July 31, 2015 at 7:39 pm
halifaxdal (7/31/2015)
Thanks Scott, I understand now.
Understand even more...
https://msdn.microsoft.com/en-us/library/ms179859.aspx
--Jeff Moden
Change is inevitable... Change for the better is not.
July 31, 2015 at 7:48 pm
Jeff Moden (7/31/2015)
halifaxdal (7/31/2015)
Thanks Scott, I understand now.Understand even more...
Thanks
August 3, 2015 at 11:20 am
Here is a solution that will handle any length of extension. I added a *.ssmsproj entry to test a longer extension.
SELECT *, LEFT(RIGHT(td.breach_content, r.filename_index-1), r.filename_index - r.extension_index - 1), RIGHT(td.breach_content, r.extension_index-1)
FROM (
VALUES('e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof')
, ('c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg')
, ('c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3')
, ('e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html')
, ('Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;')
, ('prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping')
, ('01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;')
, ('f:\SSMS\test_project.ssmsproj')
) AS td(breach_content)
CROSS APPLY (VALUES(CHARINDEX('.', REVERSE(breach_content)), CHARINDEX('\', REVERSE(breach_content)))) AS r(extension_index, filename_index)
WHERE r.extension_index > 0
ANDr.filename_index > r.extension_index
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
August 4, 2015 at 11:44 am
Thanks Drew, your script failed as it also picks up records like:
DyKE; GaY; K!{%^r; K'r; K([[}-r; K>;[|?[r; K[)r; K[-[#,[r; K[:R; K[;$[r; K[R; K[r; K^r; K{#r; K{&[R; K{&[r; K{:}>r; K{;r; K{>r; K|r; K}";R; K}$R; K}$r; K}%){"r; K}([R; K}*[r; K}<[R; K}>r; PEE; S%M; S&[m; S&}$M; S=M; S["['[(<M; S[#=m; S[%M; S[(m; S[){-@},[m; S[-{'m; S[.M; S[<[m; S[M; S[[m; S[m; S[{[m; S[|[M; S`[m; S{$[m; S{%[M; S{%m; S{;[M; S{@M; S}#[m; S}&&{'[M; S}([M; S}.m; k#[r; k)!R; k*[`[r; k-R; k.[r; k/R; k/[r; k:[r; k;[.[&r; k[,`[\})[R; k[:r; k[=R; k[>[R; k[@r; k[R; k[^{,[![r; k[r; k_r; k{*;r; k{,[R; k{-[;[)R; k{=[r; k{?R; k} [R; k}"[r; k}#r; k}$R; k}$[r; k}${.[r; k}<[\[R; k}?}(r; k}@/[r; k}@[R; pEE; s"[,{>M; s#m; s%m; s)[:{),[?m; s)[m; s-{?[?-{;M; s;.[m; s;M; s>m; s@[*[M; s[!}'m; s["(}*M; s[.M; s[<=[m; s[=[M; s[>M; s[M; s[m; s[}[m; s[~m; s\[m; s^m; s`M; s`[m; s{#[m; s{$[M; s{)}([m; s{.[m; s{>{*[M; s}'m; s}'} [<[&}?[M; s}([m; s})&[M; s}/}$M; s}>[m; s}@M; s}@[M
PEE; S&[M; S-m; S<[m; SM; S[&[m; S[([m; S[({'M; S[)M; S[=m; S[M; S[m; S[~[:[?M; S[~{(M; S[~|[m; S_{'m; Sm; S{$[m; S{'[m; S{,[M; S{.@}-[m; S{:!{-M; S{<m; S}"[M; S}-[m; S}>M; S}>[=[\[ m; S}>[M; nUe; nuE; pEE; s#[m; s#},M; s#};.;["#m; s$[m; s$m; s&[m; s'[,[m; s(M; s,[M; s,[m; s;[M; s;[{m; s@}<}![M; sM; s[ [m; s[![m; s[#[M; s[.m; s[/[m; s[<[M; s[?M; s[@}>,})[=m; s[M; s[m; s[~[m; s]m; s`}%M; sm; s{*[=m; s{,}.m; s{=["[m; s{@`[![m; s{@{?[M; s}*[.m
9mm; GAy; Gun; K [R; K!|`>[r; K#[r; K%[R; K%[r; K([,)}:{@R; K<R; K[!R; K['R; K[*r; K[R; K[][_[R; K[_}@{@R; K[`< [r; K[`[<[R; K[r; K[{{$-[r; K{"R; K{*[R; K{*r; K{,r; K{:[)][r; K{=R; K{>R; K}%[R; K},R; K}={-[r; NuE; PEE; S!M; S"[m; S$m; S%<[m; S%[m; S%}?m; S(M; S);M; S)[M; S=}%M; S=}:{"[M; S?@M; SM; S[%[M; S[*,[m; S[=[(M; S[?['m; S[M; S[^[m; S[m; Sm; S{"[:[m; S{#@}-M; S{&[,=[m; S{,[m; S{<M; S} }>m; S}$[|[-m; S}([M; S}--[m; S};M; S};[m; S}=[M; S}>M; gaY; k [r; k![r; k%[r; k'[)[R; k)[_{&[%)R; k*[r; k=r; k?R; k@\R; k[%r; k[.[R; k[;{@} ,)}'}=-[~}!R; k[=[:r; k[@[*R; k[R; k[\[r; k[r; k\[r; k{%[R; k{(R; k{([R; k{)[r; k{;}#}#R; k{>[~&{#r; k{>r; k{@$} [r; k}![R; k}![r; k},R; k}.R; k}='{)/[R; k}@R; k}@}, [R; nUE; pEE; peE; s =M; s!m; sM; s[$)[m; s['[<m; s[*[m; s[/[M; s[:[{[|[M; s[?{&m; s[M; s[m; sm; s{#[M; s{&M; s} M; s}*m; s}-[M; s}-[m; s}=[M
August 4, 2015 at 12:26 pm
halifaxdal (7/31/2015)
The data I am processing is normal in reality but a bit messy in SQL world.I have a table which contains a field to be parsed out possible filename and then its extension, sounds easy, right? no! here is some of the records for the field:
...
...
The table is generated hourly, weekly analysis is required and for a week it contains roughly 2-3 millions records.
I'm seeking your enlightening to see what would be the best approach to this, any suggestion is appreciated
The usage of LIKE comparison operator has already been suggested above. What I'm demonstrating below is an alternate approach of how to join your table with a set of blacklisted words and phrases.
First, I'll setup a table based on my assumption of how it basically works.
create table BreachLog
(
breach_content varchar(8000) not null
);
insert into BreachLog values
('e:\deploy\applications\intel management engine interface and serial over lan driver (sol) driver 7.1.2.1041v2\me_sw\mewmiprov\me\cim_schema\system\cim_computersystemdma.mof'),
('c:\$recycle.bin\s-1-5-21-3125639655-2069970247-2443061104-29869\$iqzvjd6.jpg'),
('c:\users\asdf\music\the cranberrieso need to argue\02 i can''t be with you.mp3'),
('e:\oracle\epm\docs\epm~1.111\wa_user\about_navigating_data_objects.html'),
('Marijuana; Suicide; ***; butt; don’t say anything; drug; kick; knife; knives; marijuana; murder; naked; opiate; party;'),
('prick; scam; sex; smoke; smoke a joint; smoking; sneak; speculation; stabbed; stoned; suck; tripping'),
('01 Jan. 1963; 01 SEP 1947; 01/01/1964; 01/01/58; 01/01/64; 01/01/65; 01/01/67; 01/01/68; 01/01/69; 01/01/71; 01/01/72; 01/01/73; 01/01/75; 01/01/78; 01/01/80; 01/01/82; 01/01/85; 01/02/62; 01/02/64; 01/02/66; 01/03/61; 01/03/64; 01/03/65; 01/04/66; 01/05/50; 01/05/72; 01/06/72; 01/06/76; 01/06/80; 01/07/71; 01/08/70; 01/09/79; 01/10/61; 01/10/76; 01/11/75; 01/12/73; 01/25/32; 02/01/72; 02/02/33; 02/02/63; 02/03/65; 02/03/84; 02/04/66; 02/05/69; 02/05/70;');
Next, I'll load a set of black listed words into a reference table. These 'words' are actually LIKE pattern matching expressions. You can tweak the members of this table however you need.
create table BlackList
(
black_word varchar(80) not null primary key
);
insert into BlackList values
('%.wmv'),('%.MPG%'),('%.mkv'),('%.mpeg'),('%.ogg'),('%.m4v'),
('%.3gp'),('%.AVI%'),('%.jpeg'),('%.tiff'),('%.asf'),('%.wav'),
('%.ogv'),('%.tif'),('%.flv'),('%.MP3'),('%.jpg'),('%.wma'),('%.oga');
Now you can simply inner join BreachLog with BlackList using the LIKE operator, and the members of the BlackList table can be maintained going forward using an application.
SELECT breach_content, black_word
FROM BreachLog
JOIN BlackList on BreachLog.breach_content LIKE BlackList.black_word;
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
August 4, 2015 at 1:09 pm
halifaxdal (8/4/2015)
Thanks Drew, your script failed as it also picks up records like:
I disagree. My script succeeded marvelously. It found records that you never would have found with the other approach, which is exactly what I claimed it would do. You can't expect code you get from the Internet to work perfectly without adapting it to your own specific situation.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply