Determine which pattern comes first in a string

Question

Determine which pattern comes first in a string

sqldriver

SSChampion

Points: 10440
More actions
August 20, 2014 at 11:03 am

#294850

Offhand, what's the best way to do this?
DelimitedSplit8k and PatternSplitLoop seem to have potential, but I'm just plain stuck on some things:
1. DelimitedSplit8k: the delimiter helps split the folder paths, but the pattern can be within the strings that result.
2. PatternSplitLoop: I would have to cross apply 16 times and have an awful WHERE clause to determine which of the four strings matched first.
Unless I'm missing something. Short example is below.
WITH testctes (string, pattern) AS (
SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein', 'his first' UNION ALL
SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato', 'hers first' UNION ALL
SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers', 'mine first' UNION ALL
SELECT 'more_stuff.yours\mine\hers\his_falafel', 'yours first'
)
SELECT string, pattern, ca.item, ca.itemnumber
FROM testctes
CROSS APPLY [dbo].[PatternSplitLoop] (string, '%his%') ca

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Ed Wagner SSC Guru Points: 287026 More actions · Answer 1

Okay, now that's just a really cool puzzle. The following uses the "Swiss Army Knife" that is Jeff's DelimitedSplit8K function.

WITH cteText(string) AS (

SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

SELECT 'more_stuff.yours\mine\hers\his_falafel'

),

cteSplits AS (

SELECT cteText.string, s3.ItemNumber, s3.Item,

ROW_NUMBER() OVER(PARTITION BY cteText.string ORDER BY cteText.string, s1.ItemNumber, s2.ItemNumber, s3.ItemNumber) position

FROM cteText

CROSS APPLY DelimitedSplit8K(string, '.') s1

CROSS APPLY DelimitedSplit8K(s1.item, '\') s2

CROSS APPLY DelimitedSplit8K(s2.item, '_') s3

),

cteFirsts(string, position) as (

SELECT string, MIN(position)

FROM cteSplits

WHERE Item IN ('his', 'hers', 'mine')

GROUP BY string

)

SELECT s.string, s.Item, s.position

FROM cteSplits s

INNER JOIN cteFirsts f ON f.string = s.string

and f.position = s.position

ORDER BY s.Item;

cteText defines the values you want to parse.

cteSplits uses Jeff's DelimitedSplit8K to split them out by your delimiters. You'll have to CROSS APPLY any other delimiters you need.

cteFirsts determines the first occurrence of any string you're looking for in each split set.

The query then joins cteSplits and cteFirsts together to get the string, matched item and position number.

I hope this is what you're looking for, but I have to admit that I had fun with it.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

sqldriver SSChampion Points: 10440 More actions · Answer 2

Hi Ed,

Thanks, that looks like it will get me most of the way there. I made a dumb, though, and just realized that the column the strings are stored in is a VARCHAR(MAX), which is noted to slow down the ~~function~~ wonder that is DelimitedSplit8k.

I've got it running now, but it's going on the 10 minute mark. That snarky dev with his IF...CHARINDEX that goes on for a page might have this beat :crazy:

Thanks

Luis Cazares SSC Guru Points: 183706 More actions · Answer 3

I'm not sure if this can work for you. Seems simple but I can't guarantee performance.

WITH cteText(string) AS (

SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

SELECT 'more_stuff.yours\mine\hers\his_falafel'

)

SELECT string, pattern

FROM cteText

CROSS APPLY (SELECT TOP 1 pattern

FROM (VALUES( 'his', CHARINDEX('his', string)),

( 'hers', CHARINDEX('hers', string)),

( 'mine', CHARINDEX('mine', string)),

( 'yours', CHARINDEX('yours', string)))x( pattern, position)

ORDER BY position)y

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Ed Wagner SSC Guru Points: 287026 More actions · Answer 4

Luis, nice work.

sqldriver, I would give Luis's solution a try instead. It operates on the base strings themselves and doesn't split them out into their component parts. As you're using MAX data types (which don't like to be joined to, hence the slowdown of DelimitedSplit8K) you'll likely get better performance from Luis's solution. Using CHARINDEX on the base string instead of splitting them out is a better solution if you're not looking for isolated strings between delimiters.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Luis Cazares SSC Guru Points: 183706 More actions · Answer 5

I just wanted to note that the performance of DelimitedSplit8K function won't be affected if it receives a varchar(max) as input. It will, however, truncate the string to a 8000 characters if the string is longer than that. The performance problem would occur if you change the definition of the function.

Truncation shouldn't be an issue if you're looking for the first appearance of a pattern.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Ed Wagner SSC Guru Points: 287026 More actions · Answer 6

Good point, Luis, but your solution still performs better. I've tried to get the DelimitedSplit8K function to handle MAX data types, and they simply don't like being joined to at all. I've tested it extensively using MAX data types that are less than 8K in size with identical data and the real difference is the data type itself.

I did a performance comparison of the two approaches and you should definitely uses Luis's instead of mine. I created a physical table using a single Varchar(MAX) column with no indexing and no keys.

On 10,000 rows, mine took 2750 ms and Luis's took 139 ms.

Over 100,000 rows, mint took 33250 ms and Luis's took 538 ms.

The bottom line is that if you're not looking to match isolated strings (based on delimiter), Luis's solution is the clear winner.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

sqldriver SSChampion Points: 10440 More actions · Answer 7

That runs fast, Luis, but I'm getting strange results. I would imagine it's something to do with the strings I'm actually looking for:

CROSS APPLY (SELECT TOP 1 pattern

FROM (VALUES( '_pim', CHARINDEX('[_]pim', PATH)),

( '_him', CHARINDEX('[_]him', PATH)),

( '.b.', CHARINDEX('.b.', PATH)),

( '.ib', CHARINDEX('.ib', PATH)),

( 'IM Conversation', CHARINDEX('IM Conversation', path))

)x( pattern, position)

I seem to get _him as the matched pattern, regardless of if it's first, or even in the string I'm searching. Any ideas?

Thanks

sqldriver SSChampion Points: 10440 More actions · Answer 8

Here's the whole query, if that helps. I would need some time to mask sample data well. These paths have a lot of client and employee identifying information in them.

WITH cteString ( ArtifactID, PATH )

AS ( SELECT ArtifactID ,

PATH

FROM [Document]

WHERE PATH IS NOT NULL

)

SELECT ArtifactID ,

PATH ,

y.pattern ,

CASE y.pattern

WHEN '_pim' THEN 'PublicIM'

WHEN '_him' THEN 'HubIMs'

WHEN '.b.' THEN 'B Email'

WHEN '.ib' THEN 'B Chats'

WHEN 'IB Conversation' THEN 'B Chats'

ELSE 'Check Subject and Record type etc.'

END AS [Datatype]

FROM cteString

CROSS APPLY ( SELECT TOP 1

pattern

FROM ( VALUES

( '_pim', CHARINDEX('[_]pim', PATH)),

( '_him', CHARINDEX('[_]him', PATH)),

( '.b.', CHARINDEX('.b.', PATH)),

( '.ib', CHARINDEX('.ib', PATH)),

( 'IM Conversation', CHARINDEX('IM Conversation',

path)) ) x ( pattern, position ) ORDER BY position

) y

Luis Cazares SSC Guru Points: 183706 More actions · Answer 9

It might be because you don't need the brackets to escape the underscore.

The other problem is because if the pattern is not found, then you'll get a zero as the position. To prevent this, you can change the ORDER BY to something like this:

ORDER BY CASE WHEN position = 0 THEN 999999 ELSE position END

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Luis Cazares SSC Guru Points: 183706 More actions · Answer 10

I'd also remove the CASE (which doesn't seems to match the pattern search) and include it in the APPLY results.

WITH cteString ( ArtifactID, PATH )

AS ( SELECT ArtifactID ,

PATH

FROM [Document]

WHERE PATH IS NOT NULL

)

SELECT ArtifactID ,

PATH ,

y.pattern ,

y.Datatype

FROM cteString

CROSS APPLY ( SELECT TOP 1

pattern, Datatype

FROM ( VALUES

( 'PublicIM', '_pim', CHARINDEX('[_]pim', PATH)),

( 'HubIMs', '_him', CHARINDEX('[_]him', PATH)),

( 'B Email', '.b.', CHARINDEX('.b.', PATH)),

( 'B Chats', '.ib', CHARINDEX('.ib', PATH)),

( 'Bloomberg Chats', 'IB Conversation', CHARINDEX('IB Conversation',

path)) ) x (Datatype, pattern, position )

ORDER BY CASE WHEN position = 0 THEN 999999 ELSE position END

) y

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

sqldriver SSChampion Points: 10440 More actions · Answer 11

Ah, look at that. Even better is filtering out where position > 0. That rules and runs super fast.

Hope we never lose you to any Oracle forums :hehe:

Thanks

Luis Cazares SSC Guru Points: 183706 More actions · Answer 12

I was just going to mention that. Here's a revised version to include strings with no matches at all using OUTER APPLY.

WITH cteString ( ArtifactID, PATH )

AS ( SELECT ArtifactID ,

PATH

FROM [Document]

WHERE PATH IS NOT NULL

)

SELECT ArtifactID ,

PATH ,

y.pattern ,

ISNULL( y.Datatype, 'Check Subject and Record type etc.') AS Datatype

FROM cteString

OUTER APPLY ( SELECT TOP 1

pattern, Datatype

FROM ( VALUES

( 'PublicIM', '_pim', CHARINDEX('_pim', PATH)),

( 'HubIMs', '_him', CHARINDEX('_him', PATH)),

( 'B Email', '.b.', CHARINDEX('.b.', PATH)),

( 'B Chats', '.ib', CHARINDEX('.ib', PATH)),

( 'Bloomberg Chats', 'IB Conversation', CHARINDEX('IB Conversation',

path)) ) x (Datatype, pattern, position )

WHERE position > 0

ORDER BY position

) y

EDIT: Oracle? :crazy::sick:

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Dwain Camps SSC Guru Points: 86978 More actions · Answer 13

I confess to a bit of lack of understanding here (meaning in what you expect your output to be), but since you mentioned PatternSplitLoop (presumably the one from my article) I felt obliged to reply in some fashion.

Are you looking for a solution something like this? This lists either his, hers or mine, whichever appears first (as a word) in the string. A word being any string bounded by non-letters.

WITH cteText(string) AS (

SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

SELECT 'more_stuff.yours\mine\hers\his_falafel'

)

SELECT string, item

FROM

(

SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)

FROM cteText a

CROSS APPLY dbo.PatternSplitCM(string, '[a-z]') b

WHERE [Matched]=1 AND item IN ('his', 'hers', 'mine')

) a

WHERE rn=1;

Two points here though:

- If you're referring to the PatternSplitLoop function I think you are, use PatternSplitCM as it is much faster.

- PatternSplitCM will probably suffer the same performance degradation using VARCHAR(MAX) that DelimitedSplit8K has.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 14

To get around the performance hit on the VARCHAR(MAX), you may be able to create a MAX version of the splitter you end up using and process your data something like this:

WITH cteText(string) AS (

SELECT 'oh_look_at_this.thing.his\hers_stuff\mine.craft\yours_protein' UNION ALL

SELECT 'i.am.a._thing.hers\his_things\mine.refrigerator\yours_potato' UNION ALL

SELECT 'path_like.things_mine\hers.some_else\his_garbage\yours_sneakers' UNION ALL

SELECT 'more_stuff.yours\mine\hers\his_falafel'

)

SELECT string, item

FROM

(

SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)

FROM cteText a

CROSS APPLY dbo.PatternSplitCM(string, '[a-z]') b

WHERE LEN(string) <= 8000 AND [Matched]=1 AND item IN ('his', 'hers', 'mine')

) a

WHERE rn=1

UNION ALL

SELECT string, item

FROM

(

SELECT string, item, rn=ROW_NUMBER() OVER (PARTITION BY string ORDER BY ItemNumber)

FROM cteText a

CROSS APPLY dbo.PatternSplitMAX(string, '[a-z]') b

WHERE LEN(string) > 8000 AND [Matched]=1 AND item IN ('his', 'hers', 'mine')

) a

WHERE rn=1;

Emphasis on the "maybe."

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St