Pattern matching (positive across records, negative within a field)

Question

Pattern matching (positive across records, negative within a field)

aaron.reese

SSChampion

Points: 13498
More actions
January 6, 2015 at 2:52 am

#294054

Hi Guys,
I am doing some analysis on our customer base and their payment profiles. I have generated two profile strings, one for whether the balance of an account has gone up or down and one for the size of the balance in relation to the normal invoice amount for the customer. So (for example) the balance movement string will look like this:
UUUDUUUDUUUD-D00 Where U = Up, D = Down, - = no change and 0 = no change and no balance
I want to analyse these strings in two ways. The first is that I want to find customers with a similar pattern: in the example below the first and last patterns are the same, just one out of sync but should be considered the same
Movement Multiple CountRecords
UUUDUUUDUUUD1230123012301175
------------0000000000001163
UDUUUDUUUDUU3012301230121082
The second type of analysis is to find customers whose pattern has changed: in the examples above the patterns are repeated and therefore 'normal' in the records below the patterns have changed in that the first part does not match the second part.
Movement Multiple CountRecords
UUDUUUDUUUUU-----------07
UDUUUDUUUUUU------------7
Can anyone suggest a good way to approach this without either a cursor or a hidden REBAR. The challenge as I see it is that I have to interrogate every string to find out if there is a repeating pattern and if so where it starts and how long it is (heuristic because some strings will start with a repeating pattern and then the pattern may change or deteriorate) and then compare the string for N groups of repeating characters to see if and when it changes and I can't think of an efficient method to do this in SQL because it is not a set based operation
Any help would be much appreciated
Aaron

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Nevyn SSChampion Points: 14210 More actions · Answer 1

Are there any constraints at all on how many changes can be considered a pattern, or to help identify the start/end of a new pattern iteration?

Failing that (or maybe even better than that), do the patterns all begin/end within a billing cycle that can be identified with a date field?

Failing that, are there standard patterns that you are looking for, or are you looking to match any pattern with any pattern and find all patterns?

In short, please make the problem smaller for us

aaron.reese SSChampion Points: 13498 More actions · Answer 2

@Nevyn

Unfortunately that is the simplified version 😀

Let's concentrate on the first part: finding records with the same repeating patterns but in different time phases.

A manual scan of the data shows that the most common payment pattern is UUUD where D represents the payment of 4 weeks invoices. The problem is that Customer A pays in weeks 4,8,12 and Customer B pays in weeks 1,5,9,11 The patterns are the same but the strings are different. I need to identify both strings as having the repeating UUUD pattern, identify the number of times it repeats and confirm that the straggler characters at the beginning and end of the string match the end and beginning of the previous and subsequent cycles.

Some customers pay every two weeks, some pay every five, some pay every month so effectively on a 4-4-5 cycle (hence why I need a minimum of 26 weeks data!)

There are also a large number of customers where the pattern is inconsistent or changes (e.g. they were running every 4 weeks for 3 cycles and now they have not made a payment for 8 weeks and are in arrears)

The aim is to separate customers who are in arrears but it is part on their normal payment cycle so are of no concern from those who either have no pattern or the pattern has changed, indicating that they may have a cash flow problem and need additional assistance (this is social housing rent)

The aim is to profile the customers that have gotten into trouble and check the rest of the cohort for similar customers so that any deliquent behaviour can trigger corrective action earlier and focus limited resources where they will do the most good.

I think that what I need is Ukkonen's Suffix Tree Algorithm

www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring[/url]

There are examples available in C# and Javascript but I haven't had time to digest the theory or the code and might struggle to convert it to T-SQL: even if I do then it is a multi pass operation on the string to convert it into n branches with m nodes and then more passes to traverse the tree to find the most repeated and longest repeated patterns' I have about 85K records if I use 13 weeks of history and probably 120K if I use 26 weeks of history - thats a lot of data on a RBAR type query (user defined function)

If C# versions already exist, could it be written as a CLR and called directly from the query? - that may be substantially quicker than doing it in SQL

ScottPletcher SSC Guru Points: 101118 More actions · Answer 3

First thought for first task:

Compare one customer's pattern to the concatenated pattern of another customer(s). This should alleviate the off-by-n issue. Naturally you could also reduce the repeating bytes of the second pattern to fewer than the entire string:

SELECT *

FROM (

SELECT '123012301230' AS pattern

) AS t1

CROSS JOIN (

SELECT '301230123012' AS pattern

) AS t2

WHERE

CHARINDEX(t1.pattern, t2.pattern + t2.pattern) > 0

SELECT *

FROM (

SELECT '123012301230' AS pattern

) AS t1

CROSS JOIN (

SELECT '301230123012' AS pattern

) AS t2

WHERE

CHARINDEX(t1.pattern, t2.pattern + LEFT(t2.pattern, 4)) > 0

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Dwain Camps SSC Guru Points: 86978 More actions · Answer 4

Very interesting problem there Aaron! Allow me to offer what I can here.

My interpretation of this is that it reminds me of Itzik Ben-Gan's subsequences challenge: http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-3, albeit it is necessary to convert your strings into rows to do what I'm about to do. I'm thinking that you may have already had to go through some steps to construct the character strings, so you would not need to do that if you had the raw data as rows.

Let's suppose you have a set of fixed patterns that you want to search for in the data (@Pattern), then you can use the methods described in that article to identify when subsequences of characters appear consecutively.

DECLARE @Pattern VARCHAR(10) = 'UUUD';

WITH SampleData (id, pattern) AS

(

SELECT 1, 'UUUDUUUDUUUD'

UNION ALL SELECT 2, 'UDUUUDUUUDUU'

UNION ALL SELECT 3, 'UUDUUUDUUUDU'

UNION ALL SELECT 4, 'UUUDUUUDUUUD'

),

Tally (n) AS

(

SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns

),

Pattern (n, p) AS

(

SELECT n, SUBSTRING(@Pattern, n, 1)

FROM Tally

WHERE n <= LEN(@Pattern)

)

SELECT id, startpos=MIN(b.n), endpos=MAX(b.n)

FROM SampleData a

CROSS APPLY

(

SELECT n, s=SUBSTRING(pattern, n, 1)

FROM Tally

WHERE n <= LEN(pattern)

) b

JOIN Pattern c

ON b.s = c.p

GROUP BY id, b.n - c.n

HAVING COUNT(*) = LEN(@Pattern)

ORDER BY ID, startpos;

In the above, I've used Peter Larsson's brilliantly simple approach even though mine was slightly faster (but more complex) to illustrate what I'm doing. These are the results:

id startpos endpos

1 1 4

1 5 8

1 9 12

2 3 6

2 7 10

3 4 7

3 8 11

4 2 5

4 6 9

This tells you that id=1 has 3 matches to the pattern and id=2 has 2.

You may also notice that because your strings are 12 characters and the pattern is 4 characters, the most matches you can get is 3 and if you get 2 it is highly likely that the pattern is repeating (although you'd need to check leading/trailing characters to be sure).

Clearly this is not a complete solution but perhaps it is relevant enough to give you a starting point. I'm also not sure whether you can establish what patterns you want to look for in advance.

I'm not sure if you can use this for the second part of your analysis, but perhaps.

Edit: Expanded the sample data to 4 ids.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Nevyn SSChampion Points: 14210 More actions · Answer 5

aaron.reese (1/6/2015)
@nevyn
Unfortunately that is the simplified version 😀
Let's concentrate on the first part: finding records with the same repeating patterns but in different time phases.
A manual scan of the data shows that the most common payment pattern is UUUD where D represents the payment of 4 weeks invoices. The problem is that Customer A pays in weeks 4,8,12 and Customer B pays in weeks 1,5,9,11 The patterns are the same but the strings are different. I need to identify both strings as having the repeating UUUD pattern, identify the number of times it repeats and confirm that the straggler characters at the beginning and end of the string match the end and beginning of the previous and subsequent cycles.
Some customers pay every two weeks, some pay every five, some pay every month so effectively on a 4-4-5 cycle (hence why I need a minimum of 26 weeks data!)
There are also a large number of customers where the pattern is inconsistent or changes (e.g. they were running every 4 weeks for 3 cycles and now they have not made a payment for 8 weeks and are in arrears)
The aim is to separate customers who are in arrears but it is part on their normal payment cycle so are of no concern from those who either have no pattern or the pattern has changed, indicating that they may have a cash flow problem and need additional assistance (this is social housing rent)
The aim is to profile the customers that have gotten into trouble and check the rest of the cohort for similar customers so that any deliquent behaviour can trigger corrective action earlier and focus limited resources where they will do the most good.
I think that what I need is Ukkonen's Suffix Tree Algorithm
www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring[/url]
There are examples available in C# and Javascript but I haven't had time to digest the theory or the code and might struggle to convert it to T-SQL: even if I do then it is a multi pass operation on the string to convert it into n branches with m nodes and then more passes to traverse the tree to find the most repeated and longest repeated patterns' I have about 85K records if I use 13 weeks of history and probably 120K if I use 26 weeks of history - thats a lot of data on a RBAR type query (user defined function)
If C# versions already exist, could it be written as a CLR and called directly from the query? - that may be substantially quicker than doing it in SQL

Ok, did you want to start with queries that look specifically for the UUUD pattern, or do you need to look for any pattern?

Dwain Camps SSC Guru Points: 86978 More actions · Answer 6

Nevyn (1/7/2015)
aaron.reese (1/6/2015)
@nevyn
Unfortunately that is the simplified version 😀
Let's concentrate on the first part: finding records with the same repeating patterns but in different time phases.
A manual scan of the data shows that the most common payment pattern is UUUD where D represents the payment of 4 weeks invoices. The problem is that Customer A pays in weeks 4,8,12 and Customer B pays in weeks 1,5,9,11 The patterns are the same but the strings are different. I need to identify both strings as having the repeating UUUD pattern, identify the number of times it repeats and confirm that the straggler characters at the beginning and end of the string match the end and beginning of the previous and subsequent cycles.
Some customers pay every two weeks, some pay every five, some pay every month so effectively on a 4-4-5 cycle (hence why I need a minimum of 26 weeks data!)
There are also a large number of customers where the pattern is inconsistent or changes (e.g. they were running every 4 weeks for 3 cycles and now they have not made a payment for 8 weeks and are in arrears)
The aim is to separate customers who are in arrears but it is part on their normal payment cycle so are of no concern from those who either have no pattern or the pattern has changed, indicating that they may have a cash flow problem and need additional assistance (this is social housing rent)
The aim is to profile the customers that have gotten into trouble and check the rest of the cohort for similar customers so that any deliquent behaviour can trigger corrective action earlier and focus limited resources where they will do the most good.
I think that what I need is Ukkonen's Suffix Tree Algorithm
www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring[/url]
There are examples available in C# and Javascript but I haven't had time to digest the theory or the code and might struggle to convert it to T-SQL: even if I do then it is a multi pass operation on the string to convert it into n branches with m nodes and then more passes to traverse the tree to find the most repeated and longest repeated patterns' I have about 85K records if I use 13 weeks of history and probably 120K if I use 26 weeks of history - thats a lot of data on a RBAR type query (user defined function)
If C# versions already exist, could it be written as a CLR and called directly from the query? - that may be substantially quicker than doing it in SQL
Ok, did you want to start with queries that look specifically for the UUUD pattern, or do you need to look for any pattern?

My solution can look for any list of patterns:

WITH SampleData (custid, pattern) AS

(

SELECT 1, 'UUUDUUUDUUUD'

UNION ALL SELECT 2, 'UDUUUDUUUDUU'

UNION ALL SELECT 3, 'UUDUUUDUUUDU'

UNION ALL SELECT 4, 'DUUUDUUUDUUU'

),

Patterns (patternid, pattern) AS

(

SELECT 1, 'UUUD'

UNION ALL SELECT 2, 'DUUU'

),

Tally (n) AS

(

SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns

),

Pattern AS

(

SELECT patternid, n, p=SUBSTRING(pattern, n, 1), pattern

FROM Patterns

CROSS APPLY Tally

WHERE n <= LEN(pattern)

)

SELECT custid, patternid, startpos=MIN(b.n), endpos=MAX(b.n)

FROM SampleData a

CROSS APPLY

(

SELECT n, s=SUBSTRING(a.pattern, n, 1)

FROM Tally

WHERE n <= LEN(a.pattern)

) b

JOIN Pattern c

ON b.s = c.p

GROUP BY custid, patternid, b.n - c.n

HAVING COUNT(*) = LEN(MAX(c.pattern))

ORDER BY custid, patternid, startpos;

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 7

Aaron,

High five to you for coming up with the most interesting challenge (at least to me) so far for the new year! So interesting in fact I couldn't let it go. So let's make the code I gave you a little more usable and show you how to create a set you can analyze. I've tried to document what I did with comments so will abstain from further explanation here.

I really hope you get back to me and let me know if any of this effort was not in vain!

DECLARE @Periods INT = 12; -- Probably 26 in your case

DECLARE @SampleData TABLE

(

custid INT PRIMARY KEY

,pattern VARCHAR(100)

);

-- Payment patterns

INSERT INTO @SampleData

-- A "perfectly cyclic" match to patternid = 1

SELECT 1, 'UUUDUUUDUUUD'

-- Three "cyclic" matches to patternid = 1

UNION ALL SELECT 2, 'UDUUUDUUUDUU'

UNION ALL SELECT 3, 'UUDUUUDUUUDU'

UNION ALL SELECT 4, 'DUUUDUUUDUUU'

-- A "cyclic" match to patternid = 2

UNION ALL SELECT 5, 'UUUUDUUUUDUU'

-- Matches patternid=1 but there was a delay on the second payment (2 weeks)

UNION ALL SELECT 6, 'UUUDUUUUUDUU'

-- A customer that matches none of the patterns

UNION ALL SELECT 7, 'UDUDUDUDUDUD'

;

DECLARE @Patterns TABLE

(

patternid INT PRIMARY KEY

,pattern VARCHAR(100)

);

-- Patterns to search for

INSERT INTO @Patterns

SELECT 1, 'UUUD'

UNION ALL SELECT 2, 'UUUUD';

DECLARE @PaymentAnalysis TABLE

(

custid INT NOT NULL

,patternid INT NOT NULL

,startpos INT NOT NULL

,endpos INT NOT NULL

,GapToPrior INT NOT NULL

,GapToNext INT NOT NULL

,LeadingMatch INT NULL

,TrailingMatch INT NULL

,PRIMARY KEY (custid, patternid, startpos)

);

-- Retrieve analysis set

WITH Tally (n) AS

(

SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns

),

-- This CTE isn't needed if you simply deduce the U or D from the change in payment balance

-- across rows that represent periods (i.e., throw away the profile string)

PaymentPatterns AS

(

SELECT custid, pattern, n, p=SUBSTRING(pattern, n, 1)

FROM @SampleData

CROSS APPLY Tally

WHERE n <= LEN(pattern)

),

SearchPatterns AS

(

SELECT patternid, pattern, n, s=SUBSTRING(pattern, n, 1)

FROM @Patterns

CROSS APPLY Tally

WHERE n <= LEN(pattern)

)

INSERT INTO @PaymentAnalysis

SELECT custid, patternid, startpos, endpos

-- Returns 1 if there is a gap between the prior matching string and this one

,GapToPrior = CASE LAG(endpos, 1, 0) OVER (PARTITION BY custid, patternid ORDER BY startpos)

WHEN startpos - 1 THEN 0

WHEN 0 THEN 0

ELSE 1

END

-- Returns 1 if there is a gap between the next matching string and this one

,GapToNext = CASE LEAD(startpos, 1, 0) OVER (PARTITION BY custid, patternid ORDER BY startpos)

WHEN endpos + 1 THEN 0

WHEN 0 THEN 0

ELSE 1

END

-- Match leading characters of payment pattern vs. trailing characters of search pattern

-- Note: All rows NULL if "perfectly cyclic" meaning 3 matches on a 4 char search string in a 12 char payment pattern

,LeadingMatch = CASE MIN(startpos) OVER (PARTITION BY custid, patternid)

-- When first match at position one, no need to check against trailing pattern chars

WHEN 1

THEN NULL

-- Match trailing search characters up to position one for the first startpos row

WHEN startpos

THEN CASE LEFT(PaymentPattern, startpos-1)

WHEN RIGHT(SearchPattern, startpos-1) THEN 1

ELSE 0

END

-- Ignore other rows

ELSE NULL

END

-- Match trailing characters of payment pattern vs. leading characters of search pattern

-- Note: All rows NULL if "perfectly cyclic" meaning 3 matches on a 4 char search string in a 12 char payment pattern

,TrailingMatch = CASE MAX(endpos) OVER (PARTITION BY custid, patternid)

-- When last match ends at the end of the payment pattern, no need to check against leading pattern chars

WHEN LEN(PaymentPattern)

THEN NULL

-- Match leading search characters against trailing payment characters

WHEN endpos

THEN CASE RIGHT(PaymentPattern, LEN(PaymentPattern)-(endpos+1))

WHEN LEFT(SearchPattern, LEN(PaymentPattern)-(endpos+1)) THEN 1

ELSE 0

END

-- Ignore other rows

ELSE NULL

END

FROM

(

SELECT custid, patternid, startpos=MIN(a.n), endpos=MAX(a.n)

,PaymentPattern = MAX(a.pattern)

,SearchPattern = MAX(b.pattern)

FROM PaymentPatterns a

JOIN SearchPatterns b

ON a.p = b.s -- payment pattern char = search pattern char

GROUP BY custid, patternid, a.n - b.n

HAVING COUNT(*) = LEN(MAX(b.pattern))

) a;

-- Here's the payment analysis set

SELECT *

FROM @PaymentAnalysis

ORDER BY custid, patternid, startpos;

-- List "perfectly cyclic" matches (i.e., on a 4 char search string in a 12 character pattern there are 3 rows)

SELECT custid, patternid

FROM @PaymentAnalysis a

--WHERE patternid = 1

GROUP BY custid, patternid

HAVING COUNT(*) = 1. * @Periods / (SELECT LEN(b.pattern) FROM @Patterns b WHERE a.patternid = b.patternid);

-- List "cyclic" matches, meaning leading or trailing characters match the search pattern

-- but the match count is one less than periods / length of the search pattern

SELECT custid, patternid

FROM @PaymentAnalysis a

--WHERE patternid = 1

GROUP BY custid, patternid

HAVING COUNT(*) = CEILING(1. * @Periods / (SELECT LEN(b.pattern) FROM @Patterns b WHERE a.patternid = b.patternid)) - 1 AND

MAX(GapToPrior) = 0 AND MAX(GapToNext) = 0 AND

MAX(ISNULL(LeadingMatch, 1)) = 1 AND MAX(ISNULL(TrailingMatch, 1)) = 1;

-- List customers matching one or more of the patterns but who have changed their pattern

-- Note: This returns customer 6 twice because he matches both patterns at least once

SELECT custid, patternid

FROM @PaymentAnalysis a

--WHERE patternid = 1

GROUP BY custid, patternid

HAVING COUNT(*) < 1. * @Periods / (SELECT LEN(b.pattern) FROM @Patterns b WHERE a.patternid = b.patternid) AND

(MAX(GapToPrior) = 1 OR MAX(GapToNext) = 1 OR

MAX(ISNULL(LeadingMatch, 1)) = 0 OR MAX(ISNULL(TrailingMatch, 1)) = 0);

-- List customers not matching any of the search patterns

SELECT custid

FROM @SampleData

EXCEPT

SELECT custid

FROM @PaymentAnalysis;

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Nevyn SSChampion Points: 14210 More actions · Answer 8

dwain.c (1/7/2015)
My solution can look for any list of patterns:

Right, but you need the list.

I was more asking if he needed the code to identify patterns then look for them, or simply look for one or more specified ones (as his does).

Of course, you could quickly build out a table of possible patterns given certain reasonable restrictions on the length of the pattern. Just trying to figure out if such a step is necessary.

aaron.reese SSChampion Points: 13498 More actions · Answer 9

Dwain, Nevyn,

Wow Dwain that is some complex code! I have only scanned it and like Nevyn says, it requires the user to seed it with a set of predefined patterns. In this instance, that will probably suffice, but it is not a generic solution. I am working with a limited number of characters (U,D,0,-) and so can be definition only have a limited number of patterns.

I still think that a CLR is the only real solution - but my C# skills are not good enough :crazy:

For production I have basically punched holes in the set by running a series of pattern like '%UUD%' queries and stored the resulting set with a corhort description. It's not pretty but it gets the job done for the time being...

Dwain Camps SSC Guru Points: 86978 More actions · Answer 10

aaron.reese (1/8/2015)
Dwain, Nevyn,
Wow Dwain that is some complex code! I have only scanned it and like Nevyn says, it requires the user to seed it with a set of predefined patterns. In this instance, that will probably suffice, but it is not a generic solution. I am working with a limited number of characters (U,D,0,-) and so can be definition only have a limited number of patterns.
I still think that a CLR is the only real solution - but my C# skills are not good enough :crazy:
For production I have basically punched holes in the set by running a series of pattern like '%UUD%' queries and stored the resulting set with a corhort description. It's not pretty but it gets the job done for the time being...

Or you can just use something like this to generate the seed patterns:

DECLARE @SampleData TABLE

(

custid INT PRIMARY KEY

,pattern VARCHAR(100)

);

-- Payment patterns

INSERT INTO @SampleData

-- A "perfectly cyclic" match to patternid = 1

SELECT 1, 'UUUDUUUDUUUD'

-- Three "cyclic" matches to patternid = 1

UNION ALL SELECT 2, 'UDUUUDUUUDUU'

UNION ALL SELECT 3, 'UUDUUUDUUUDU'

UNION ALL SELECT 4, 'DUUUDUUUDUUU'

-- A "cyclic" match to patternid = 2

UNION ALL SELECT 5, 'UUUUDUUUUDUU'

-- Matches patternid=1 but there was a delay on the second payment (2 weeks)

UNION ALL SELECT 6, 'UUUDUUUUUDUU'

-- A customer that matches none of the patterns

UNION ALL SELECT 7, 'UDUDUDUDUDUD'

;

SELECT patternid=ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

,pattern

FROM

(

SELECT pattern=LEFT(pattern, 4)

FROM @SampleData

UNION ALL

SELECT LEFT(pattern, 5)

FROM @SampleData

) a

GROUP BY pattern;

Obviously you may want to exclude some of these by looking at the above results and deciding which patterns it makes sense to search for. You already mentioned that you were interested in the UUUD pattern and one that would work for a 4-4-5 monthly payment cycle, so I assumed that meant you did have some idea of what you were interested in identifying.

And with respect to the complexity, keep in mind that the main method of matching the subsequences developed by Peter Larsson is actually quite simple and also quite fast (his code is amazingly terse). Most of the work overall in what I did is in:

- Splitting your profiles back to individual rows (which I believe is probably not required if you work directly with your real, underlying data)

- Adding the logic to calculate GapToPrior, GapToNext, LeadingMatch and TrailingMatch in the analysis set (none of which is going to slow it down very much)

- Then constructing the queries to pattern match against the analysis set, for which I only provided some basic examples.

In the end of course I'm happy you got something that works for your case. I found the problem quite interesting and as I said before the most interesting I've seen for the new year.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St