August 20, 2014 at 11:03 am
Offhand, what's the best way to do this?
DelimitedSplit8k and PatternSplitLoop seem to have potential, but I'm just plain stuck on some things:
1. DelimitedSplit8k: the delimiter helps split the folder paths, but the pattern can be within the strings that result.
2. PatternSplitLoop: I would have to cross apply 16 times and have an awful WHERE clause to determine which of the four strings matched first.
Unless I'm missing something. Short example is below.
WITH testctes (string, pattern) AS (
SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein', 'his first' UNION ALL
SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato', 'hers first' UNION ALL
SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers', 'mine first' UNION ALL
SELECT 'more_stuff.yours\mine\hers\his_falafel', 'yours first'
)
SELECT string, pattern, ca.item, ca.itemnumber
FROM testctes
CROSS APPLY [dbo].[PatternSplitLoop] (string, '%his%') ca
August 20, 2014 at 11:43 am
Okay, now that's just a really cool puzzle. The following uses the "Swiss Army Knife" that is Jeff's DelimitedSplit8K function.
WITH cteText(string) AS (
SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL
SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL
SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL
SELECT 'more_stuff.yours\mine\hers\his_falafel'
),
cteSplits AS (
SELECT cteText.string, s3.ItemNumber, s3.Item,
ROW_NUMBER() OVER(PARTITION BY cteText.string ORDER BY cteText.string, s1.ItemNumber, s2.ItemNumber, s3.ItemNumber) position
FROM cteText
CROSS APPLY DelimitedSplit8K(string, '.') s1
CROSS APPLY DelimitedSplit8K(s1.item, '\') s2
CROSS APPLY DelimitedSplit8K(s2.item, '_') s3
),
cteFirsts(string, position) as (
SELECT string, MIN(position)
FROM cteSplits
WHERE Item IN ('his', 'hers', 'mine')
GROUP BY string
)
SELECT s.string, s.Item, s.position
FROM cteSplits s
INNER JOIN cteFirsts f ON f.string = s.string
and f.position = s.position
ORDER BY s.Item;
cteText defines the values you want to parse.
cteSplits uses Jeff's DelimitedSplit8K to split them out by your delimiters. You'll have to CROSS APPLY any other delimiters you need.
cteFirsts determines the first occurrence of any string you're looking for in each split set.
The query then joins cteSplits and cteFirsts together to get the string, matched item and position number.
I hope this is what you're looking for, but I have to admit that I had fun with it.
August 20, 2014 at 12:22 pm
Hi Ed,
Thanks, that looks like it will get me most of the way there. I made a dumb, though, and just realized that the column the strings are stored in is a VARCHAR(MAX), which is noted to slow down the function wonder that is DelimitedSplit8k.
I've got it running now, but it's going on the 10 minute mark. That snarky dev with his IF...CHARINDEX that goes on for a page might have this beat :crazy:
Thanks
August 20, 2014 at 12:57 pm
I'm not sure if this can work for you. Seems simple but I can't guarantee performance.
WITH cteText(string) AS (
SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL
SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL
SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL
SELECT 'more_stuff.yours\mine\hers\his_falafel'
)
SELECT string, pattern
FROM cteText
CROSS APPLY (SELECT TOP 1 pattern
FROM (VALUES( 'his', CHARINDEX('his', string)),
( 'hers', CHARINDEX('hers', string)),
( 'mine', CHARINDEX('mine', string)),
( 'yours', CHARINDEX('yours', string)))x( pattern, position)
ORDER BY position)y
August 20, 2014 at 1:31 pm
Luis, nice work.
sqldriver, I would give Luis's solution a try instead. It operates on the base strings themselves and doesn't split them out into their component parts. As you're using MAX data types (which don't like to be joined to, hence the slowdown of DelimitedSplit8K) you'll likely get better performance from Luis's solution. Using CHARINDEX on the base string instead of splitting them out is a better solution if you're not looking for isolated strings between delimiters.
August 20, 2014 at 1:38 pm
I just wanted to note that the performance of DelimitedSplit8K function won't be affected if it receives a varchar(max) as input. It will, however, truncate the string to a 8000 characters if the string is longer than that. The performance problem would occur if you change the definition of the function.
Truncation shouldn't be an issue if you're looking for the first appearance of a pattern.
August 20, 2014 at 1:49 pm
Good point, Luis, but your solution still performs better. I've tried to get the DelimitedSplit8K function to handle MAX data types, and they simply don't like being joined to at all. I've tested it extensively using MAX data types that are less than 8K in size with identical data and the real difference is the data type itself.
I did a performance comparison of the two approaches and you should definitely uses Luis's instead of mine. I created a physical table using a single Varchar(MAX) column with no indexing and no keys.
On 10,000 rows, mine took 2750 ms and Luis's took 139 ms.
Over 100,000 rows, mint took 33250 ms and Luis's took 538 ms.
The bottom line is that if you're not looking to match isolated strings (based on delimiter), Luis's solution is the clear winner.
August 20, 2014 at 1:50 pm
That runs fast, Luis, but I'm getting strange results. I would imagine it's something to do with the strings I'm actually looking for:
CROSS APPLY (SELECT TOP 1 pattern
FROM (VALUES( '_pim', CHARINDEX('[_]pim', PATH)),
( '_him', CHARINDEX('[_]him', PATH)),
( '.b.', CHARINDEX('.b.', PATH)),
( '.ib', CHARINDEX('.ib', PATH)),
( 'IM Conversation', CHARINDEX('IM Conversation', path))
)x( pattern, position)
I seem to get _him as the matched pattern, regardless of if it's first, or even in the string I'm searching. Any ideas?
Thanks
August 20, 2014 at 1:57 pm
Here's the whole query, if that helps. I would need some time to mask sample data well. These paths have a lot of client and employee identifying information in them.
WITH cteString ( ArtifactID, PATH )
AS ( SELECT ArtifactID ,
PATH
FROM [Document]
WHERE PATH IS NOT NULL
)
SELECT ArtifactID ,
PATH ,
y.pattern ,
CASE y.pattern
WHEN '_pim' THEN 'PublicIM'
WHEN '_him' THEN 'HubIMs'
WHEN '.b.' THEN 'B Email'
WHEN '.ib' THEN 'B Chats'
WHEN 'IB Conversation' THEN 'B Chats'
ELSE 'Check Subject and Record type etc.'
END AS [Datatype]
FROM cteString
CROSS APPLY ( SELECT TOP 1
pattern
FROM ( VALUES
( '_pim', CHARINDEX('[_]pim', PATH)),
( '_him', CHARINDEX('[_]him', PATH)),
( '.b.', CHARINDEX('.b.', PATH)),
( '.ib', CHARINDEX('.ib', PATH)),
( 'IM Conversation', CHARINDEX('IM Conversation',
path)) ) x ( pattern, position ) ORDER BY position
) y
August 20, 2014 at 2:03 pm
It might be because you don't need the brackets to escape the underscore.
The other problem is because if the pattern is not found, then you'll get a zero as the position. To prevent this, you can change the ORDER BY to something like this:
ORDER BY CASE WHEN position = 0 THEN 999999 ELSE position END
August 20, 2014 at 2:14 pm
I'd also remove the CASE (which doesn't seems to match the pattern search) and include it in the APPLY results.
WITH cteString ( ArtifactID, PATH )
AS ( SELECT ArtifactID ,
PATH
FROM [Document]
WHERE PATH IS NOT NULL
)
SELECT ArtifactID ,
PATH ,
y.pattern ,
y.Datatype
FROM cteString
CROSS APPLY ( SELECT TOP 1
pattern, Datatype
FROM ( VALUES
( 'PublicIM', '_pim', CHARINDEX('[_]pim', PATH)),
( 'HubIMs', '_him', CHARINDEX('[_]him', PATH)),
( 'B Email', '.b.', CHARINDEX('.b.', PATH)),
( 'B Chats', '.ib', CHARINDEX('.ib', PATH)),
( 'Bloomberg Chats', 'IB Conversation', CHARINDEX('IB Conversation',
path)) ) x (Datatype, pattern, position )
ORDER BY CASE WHEN position = 0 THEN 999999 ELSE position END
) y
August 20, 2014 at 2:15 pm
Ah, look at that. Even better is filtering out where position > 0. That rules and runs super fast.
Hope we never lose you to any Oracle forums :hehe:
Thanks
August 20, 2014 at 2:22 pm
I was just going to mention that. Here's a revised version to include strings with no matches at all using OUTER APPLY.
WITH cteString ( ArtifactID, PATH )
AS ( SELECT ArtifactID ,
PATH
FROM [Document]
WHERE PATH IS NOT NULL
)
SELECT ArtifactID ,
PATH ,
y.pattern ,
ISNULL( y.Datatype, 'Check Subject and Record type etc.') AS Datatype
FROM cteString
OUTER APPLY ( SELECT TOP 1
pattern, Datatype
FROM ( VALUES
( 'PublicIM', '_pim', CHARINDEX('_pim', PATH)),
( 'HubIMs', '_him', CHARINDEX('_him', PATH)),
( 'B Email', '.b.', CHARINDEX('.b.', PATH)),
( 'B Chats', '.ib', CHARINDEX('.ib', PATH)),
( 'Bloomberg Chats', 'IB Conversation', CHARINDEX('IB Conversation',
path)) ) x (Datatype, pattern, position )
WHERE position > 0
ORDER BY position
) y
EDIT: Oracle? :crazy::sick:
August 24, 2014 at 7:36 pm
I confess to a bit of lack of understanding here (meaning in what you expect your output to be), but since you mentioned PatternSplitLoop (presumably the one from my article) I felt obliged to reply in some fashion.
Are you looking for a solution something like this? This lists either his, hers or mine, whichever appears first (as a word) in the string. A word being any string bounded by non-letters.
WITH cteText(string) AS (
SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL
SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL
SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL
SELECT 'more_stuff.yours\mine\hers\his_falafel'
)
SELECT string, item
FROM
(
SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)
FROM cteText a
CROSS APPLY dbo.PatternSplitCM(string, '[a-z]') b
WHERE [Matched]=1 AND item IN ('his', 'hers', 'mine')
) a
WHERE rn=1;
Two points here though:
- If you're referring to the PatternSplitLoop function I think you are, use PatternSplitCM as it is much faster.
- PatternSplitCM will probably suffer the same performance degradation using VARCHAR(MAX) that DelimitedSplit8K has.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
August 24, 2014 at 8:05 pm
To get around the performance hit on the VARCHAR(MAX), you may be able to create a MAX version of the splitter you end up using and process your data something like this:
WITH cteText(string) AS (
SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL
SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL
SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL
SELECT 'more_stuff.yours\mine\hers\his_falafel'
)
SELECT string, item
FROM
(
SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)
FROM cteText a
CROSS APPLY dbo.PatternSplitCM(string, '[a-z]') b
WHERE LEN(string) <= 8000 AND [Matched]=1 AND item IN ('his', 'hers', 'mine')
) a
WHERE rn=1
UNION ALL
SELECT string, item
FROM
(
SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)
FROM cteText a
CROSS APPLY dbo.PatternSplitMAX(string, '[a-z]') b
WHERE LEN(string) > 8000 AND [Matched]=1 AND item IN ('his', 'hers', 'mine')
) a
WHERE rn=1;
Emphasis on the "maybe."
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 15 posts - 1 through 15 (of 19 total)
You must be logged in to reply to this topic. Login to reply