Pattern matching (positive across records, negative within a field)

  • Hi Guys,

    I am doing some analysis on our customer base and their payment profiles. I have generated two profile strings, one for whether the balance of an account has gone up or down and one for the size of the balance in relation to the normal invoice amount for the customer. So (for example) the balance movement string will look like this:

    UUUDUUUDUUUD-D00 Where U = Up, D = Down, - = no change and 0 = no change and no balance

    I want to analyse these strings in two ways. The first is that I want to find customers with a similar pattern: in the example below the first and last patterns are the same, just one out of sync but should be considered the same

    Movement Multiple CountRecords

    UUUDUUUDUUUD1230123012301175

    ------------0000000000001163

    UDUUUDUUUDUU3012301230121082

    The second type of analysis is to find customers whose pattern has changed: in the examples above the patterns are repeated and therefore 'normal' in the records below the patterns have changed in that the first part does not match the second part.

    Movement Multiple CountRecords

    UUDUUUDUUUUU-----------07

    UDUUUDUUUUUU------------7

    Can anyone suggest a good way to approach this without either a cursor or a hidden REBAR. The challenge as I see it is that I have to interrogate every string to find out if there is a repeating pattern and if so where it starts and how long it is (heuristic because some strings will start with a repeating pattern and then the pattern may change or deteriorate) and then compare the string for N groups of repeating characters to see if and when it changes and I can't think of an efficient method to do this in SQL because it is not a set based operation

    Any help would be much appreciated

    Aaron

  • Are there any constraints at all on how many changes can be considered a pattern, or to help identify the start/end of a new pattern iteration?

    Failing that (or maybe even better than that), do the patterns all begin/end within a billing cycle that can be identified with a date field?

    Failing that, are there standard patterns that you are looking for, or are you looking to match any pattern with any pattern and find all patterns?

    In short, please make the problem smaller for us

  • @Nevyn

    Unfortunately that is the simplified version 😀

    Let's concentrate on the first part: finding records with the same repeating patterns but in different time phases.

    A manual scan of the data shows that the most common payment pattern is UUUD where D represents the payment of 4 weeks invoices. The problem is that Customer A pays in weeks 4,8,12 and Customer B pays in weeks 1,5,9,11 The patterns are the same but the strings are different. I need to identify both strings as having the repeating UUUD pattern, identify the number of times it repeats and confirm that the straggler characters at the beginning and end of the string match the end and beginning of the previous and subsequent cycles.

    Some customers pay every two weeks, some pay every five, some pay every month so effectively on a 4-4-5 cycle (hence why I need a minimum of 26 weeks data!)

    There are also a large number of customers where the pattern is inconsistent or changes (e.g. they were running every 4 weeks for 3 cycles and now they have not made a payment for 8 weeks and are in arrears)

    The aim is to separate customers who are in arrears but it is part on their normal payment cycle so are of no concern from those who either have no pattern or the pattern has changed, indicating that they may have a cash flow problem and need additional assistance (this is social housing rent)

    The aim is to profile the customers that have gotten into trouble and check the rest of the cohort for similar customers so that any deliquent behaviour can trigger corrective action earlier and focus limited resources where they will do the most good.

    I think that what I need is Ukkonen's Suffix Tree Algorithm

    www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring[/url]

    There are examples available in C# and Javascript but I haven't had time to digest the theory or the code and might struggle to convert it to T-SQL: even if I do then it is a multi pass operation on the string to convert it into n branches with m nodes and then more passes to traverse the tree to find the most repeated and longest repeated patterns' I have about 85K records if I use 13 weeks of history and probably 120K if I use 26 weeks of history - thats a lot of data on a RBAR type query (user defined function)

    If C# versions already exist, could it be written as a CLR and called directly from the query? - that may be substantially quicker than doing it in SQL

  • First thought for first task:

    Compare one customer's pattern to the concatenated pattern of another customer(s). This should alleviate the off-by-n issue. Naturally you could also reduce the repeating bytes of the second pattern to fewer than the entire string:

    SELECT *

    FROM (

    SELECT '123012301230' AS pattern

    ) AS t1

    CROSS JOIN (

    SELECT '301230123012' AS pattern

    ) AS t2

    WHERE

    CHARINDEX(t1.pattern, t2.pattern + t2.pattern) > 0

    SELECT *

    FROM (

    SELECT '123012301230' AS pattern

    ) AS t1

    CROSS JOIN (

    SELECT '301230123012' AS pattern

    ) AS t2

    WHERE

    CHARINDEX(t1.pattern, t2.pattern + LEFT(t2.pattern, 4)) > 0

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Very interesting problem there Aaron! Allow me to offer what I can here.

    My interpretation of this is that it reminds me of Itzik Ben-Gan's subsequences challenge: http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-3, albeit it is necessary to convert your strings into rows to do what I'm about to do. I'm thinking that you may have already had to go through some steps to construct the character strings, so you would not need to do that if you had the raw data as rows.

    Let's suppose you have a set of fixed patterns that you want to search for in the data (@Pattern), then you can use the methods described in that article to identify when subsequences of characters appear consecutively.

    DECLARE @Pattern VARCHAR(10) = 'UUUD';

    WITH SampleData (id, pattern) AS

    (

    SELECT 1, 'UUUDUUUDUUUD'

    UNION ALL SELECT 2, 'UDUUUDUUUDUU'

    UNION ALL SELECT 3, 'UUDUUUDUUUDU'

    UNION ALL SELECT 4, 'UUUDUUUDUUUD'

    ),

    Tally (n) AS

    (

    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns

    ),

    Pattern (n, p) AS

    (

    SELECT n, SUBSTRING(@Pattern, n, 1)

    FROM Tally

    WHERE n <= LEN(@Pattern)

    )

    SELECT id, startpos=MIN(b.n), endpos=MAX(b.n)

    FROM SampleData a

    CROSS APPLY

    (

    SELECT n, s=SUBSTRING(pattern, n, 1)

    FROM Tally

    WHERE n <= LEN(pattern)

    ) b

    JOIN Pattern c

    ON b.s = c.p

    GROUP BY id, b.n - c.n

    HAVING COUNT(*) = LEN(@Pattern)

    ORDER BY ID, startpos;

    In the above, I've used Peter Larsson's brilliantly simple approach even though mine was slightly faster (but more complex) to illustrate what I'm doing. These are the results:

    id startpos endpos

    1 1 4

    1 5 8

    1 9 12

    2 3 6

    2 7 10

    3 4 7

    3 8 11

    4 2 5

    4 6 9

    This tells you that id=1 has 3 matches to the pattern and id=2 has 2.

    You may also notice that because your strings are 12 characters and the pattern is 4 characters, the most matches you can get is 3 and if you get 2 it is highly likely that the pattern is repeating (although you'd need to check leading/trailing characters to be sure).

    Clearly this is not a complete solution but perhaps it is relevant enough to give you a starting point. I'm also not sure whether you can establish what patterns you want to look for in advance.

    I'm not sure if you can use this for the second part of your analysis, but perhaps.

    Edit: Expanded the sample data to 4 ids.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • aaron.reese (1/6/2015)


    @nevyn

    Unfortunately that is the simplified version 😀

    Let's concentrate on the first part: finding records with the same repeating patterns but in different time phases.

    A manual scan of the data shows that the most common payment pattern is UUUD where D represents the payment of 4 weeks invoices. The problem is that Customer A pays in weeks 4,8,12 and Customer B pays in weeks 1,5,9,11 The patterns are the same but the strings are different. I need to identify both strings as having the repeating UUUD pattern, identify the number of times it repeats and confirm that the straggler characters at the beginning and end of the string match the end and beginning of the previous and subsequent cycles.

    Some customers pay every two weeks, some pay every five, some pay every month so effectively on a 4-4-5 cycle (hence why I need a minimum of 26 weeks data!)

    There are also a large number of customers where the pattern is inconsistent or changes (e.g. they were running every 4 weeks for 3 cycles and now they have not made a payment for 8 weeks and are in arrears)

    The aim is to separate customers who are in arrears but it is part on their normal payment cycle so are of no concern from those who either have no pattern or the pattern has changed, indicating that they may have a cash flow problem and need additional assistance (this is social housing rent)

    The aim is to profile the customers that have gotten into trouble and check the rest of the cohort for similar customers so that any deliquent behaviour can trigger corrective action earlier and focus limited resources where they will do the most good.

    I think that what I need is Ukkonen's Suffix Tree Algorithm

    www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring[/url]

    There are examples available in C# and Javascript but I haven't had time to digest the theory or the code and might struggle to convert it to T-SQL: even if I do then it is a multi pass operation on the string to convert it into n branches with m nodes and then more passes to traverse the tree to find the most repeated and longest repeated patterns' I have about 85K records if I use 13 weeks of history and probably 120K if I use 26 weeks of history - thats a lot of data on a RBAR type query (user defined function)

    If C# versions already exist, could it be written as a CLR and called directly from the query? - that may be substantially quicker than doing it in SQL

    Ok, did you want to start with queries that look specifically for the UUUD pattern, or do you need to look for any pattern?

  • Nevyn (1/7/2015)


    aaron.reese (1/6/2015)


    @nevyn

    Unfortunately that is the simplified version 😀

    Let's concentrate on the first part: finding records with the same repeating patterns but in different time phases.

    A manual scan of the data shows that the most common payment pattern is UUUD where D represents the payment of 4 weeks invoices. The problem is that Customer A pays in weeks 4,8,12 and Customer B pays in weeks 1,5,9,11 The patterns are the same but the strings are different. I need to identify both strings as having the repeating UUUD pattern, identify the number of times it repeats and confirm that the straggler characters at the beginning and end of the string match the end and beginning of the previous and subsequent cycles.

    Some customers pay every two weeks, some pay every five, some pay every month so effectively on a 4-4-5 cycle (hence why I need a minimum of 26 weeks data!)

    There are also a large number of customers where the pattern is inconsistent or changes (e.g. they were running every 4 weeks for 3 cycles and now they have not made a payment for 8 weeks and are in arrears)

    The aim is to separate customers who are in arrears but it is part on their normal payment cycle so are of no concern from those who either have no pattern or the pattern has changed, indicating that they may have a cash flow problem and need additional assistance (this is social housing rent)

    The aim is to profile the customers that have gotten into trouble and check the rest of the cohort for similar customers so that any deliquent behaviour can trigger corrective action earlier and focus limited resources where they will do the most good.

    I think that what I need is Ukkonen's Suffix Tree Algorithm

    www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring[/url]

    There are examples available in C# and Javascript but I haven't had time to digest the theory or the code and might struggle to convert it to T-SQL: even if I do then it is a multi pass operation on the string to convert it into n branches with m nodes and then more passes to traverse the tree to find the most repeated and longest repeated patterns' I have about 85K records if I use 13 weeks of history and probably 120K if I use 26 weeks of history - thats a lot of data on a RBAR type query (user defined function)

    If C# versions already exist, could it be written as a CLR and called directly from the query? - that may be substantially quicker than doing it in SQL

    Ok, did you want to start with queries that look specifically for the UUUD pattern, or do you need to look for any pattern?

    My solution can look for any list of patterns:

    WITH SampleData (custid, pattern) AS

    (

    SELECT 1, 'UUUDUUUDUUUD'

    UNION ALL SELECT 2, 'UDUUUDUUUDUU'

    UNION ALL SELECT 3, 'UUDUUUDUUUDU'

    UNION ALL SELECT 4, 'DUUUDUUUDUUU'

    ),

    Patterns (patternid, pattern) AS

    (

    SELECT 1, 'UUUD'

    UNION ALL SELECT 2, 'DUUU'

    ),

    Tally (n) AS

    (

    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns

    ),

    Pattern AS

    (

    SELECT patternid, n, p=SUBSTRING(pattern, n, 1), pattern

    FROM Patterns

    CROSS APPLY Tally

    WHERE n <= LEN(pattern)

    )

    SELECT custid, patternid, startpos=MIN(b.n), endpos=MAX(b.n)

    FROM SampleData a

    CROSS APPLY

    (

    SELECT n, s=SUBSTRING(a.pattern, n, 1)

    FROM Tally

    WHERE n <= LEN(a.pattern)

    ) b

    JOIN Pattern c

    ON b.s = c.p

    GROUP BY custid, patternid, b.n - c.n

    HAVING COUNT(*) = LEN(MAX(c.pattern))

    ORDER BY custid, patternid, startpos;


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Aaron,

    High five to you for coming up with the most interesting challenge (at least to me) so far for the new year! So interesting in fact I couldn't let it go. So let's make the code I gave you a little more usable and show you how to create a set you can analyze. I've tried to document what I did with comments so will abstain from further explanation here.

    I really hope you get back to me and let me know if any of this effort was not in vain!

    DECLARE @Periods INT = 12; -- Probably 26 in your case

    DECLARE @SampleData TABLE

    (

    custid INT PRIMARY KEY

    ,pattern VARCHAR(100)

    );

    -- Payment patterns

    INSERT INTO @SampleData

    -- A "perfectly cyclic" match to patternid = 1

    SELECT 1, 'UUUDUUUDUUUD'

    -- Three "cyclic" matches to patternid = 1

    UNION ALL SELECT 2, 'UDUUUDUUUDUU'

    UNION ALL SELECT 3, 'UUDUUUDUUUDU'

    UNION ALL SELECT 4, 'DUUUDUUUDUUU'

    -- A "cyclic" match to patternid = 2

    UNION ALL SELECT 5, 'UUUUDUUUUDUU'

    -- Matches patternid=1 but there was a delay on the second payment (2 weeks)

    UNION ALL SELECT 6, 'UUUDUUUUUDUU'

    -- A customer that matches none of the patterns

    UNION ALL SELECT 7, 'UDUDUDUDUDUD'

    ;

    DECLARE @Patterns TABLE

    (

    patternid INT PRIMARY KEY

    ,pattern VARCHAR(100)

    );

    -- Patterns to search for

    INSERT INTO @Patterns

    SELECT 1, 'UUUD'

    UNION ALL SELECT 2, 'UUUUD';

    DECLARE @PaymentAnalysis TABLE

    (

    custid INT NOT NULL

    ,patternid INT NOT NULL

    ,startpos INT NOT NULL

    ,endpos INT NOT NULL

    ,GapToPrior INT NOT NULL

    ,GapToNext INT NOT NULL

    ,LeadingMatch INT NULL

    ,TrailingMatch INT NULL

    ,PRIMARY KEY (custid, patternid, startpos)

    );

    -- Retrieve analysis set

    WITH Tally (n) AS

    (

    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns

    ),

    -- This CTE isn't needed if you simply deduce the U or D from the change in payment balance

    -- across rows that represent periods (i.e., throw away the profile string)

    PaymentPatterns AS

    (

    SELECT custid, pattern, n, p=SUBSTRING(pattern, n, 1)

    FROM @SampleData

    CROSS APPLY Tally

    WHERE n <= LEN(pattern)

    ),

    SearchPatterns AS

    (

    SELECT patternid, pattern, n, s=SUBSTRING(pattern, n, 1)

    FROM @Patterns

    CROSS APPLY Tally

    WHERE n <= LEN(pattern)

    )

    INSERT INTO @PaymentAnalysis

    SELECT custid, patternid, startpos, endpos

    -- Returns 1 if there is a gap between the prior matching string and this one

    ,GapToPrior = CASE LAG(endpos, 1, 0) OVER (PARTITION BY custid, patternid ORDER BY startpos)

    WHEN startpos - 1 THEN 0

    WHEN 0 THEN 0

    ELSE 1

    END

    -- Returns 1 if there is a gap between the next matching string and this one

    ,GapToNext = CASE LEAD(startpos, 1, 0) OVER (PARTITION BY custid, patternid ORDER BY startpos)

    WHEN endpos + 1 THEN 0

    WHEN 0 THEN 0

    ELSE 1

    END

    -- Match leading characters of payment pattern vs. trailing characters of search pattern

    -- Note: All rows NULL if "perfectly cyclic" meaning 3 matches on a 4 char search string in a 12 char payment pattern

    ,LeadingMatch = CASE MIN(startpos) OVER (PARTITION BY custid, patternid)

    -- When first match at position one, no need to check against trailing pattern chars

    WHEN 1

    THEN NULL

    -- Match trailing search characters up to position one for the first startpos row

    WHEN startpos

    THEN CASE LEFT(PaymentPattern, startpos-1)

    WHEN RIGHT(SearchPattern, startpos-1) THEN 1

    ELSE 0

    END

    -- Ignore other rows

    ELSE NULL

    END

    -- Match trailing characters of payment pattern vs. leading characters of search pattern

    -- Note: All rows NULL if "perfectly cyclic" meaning 3 matches on a 4 char search string in a 12 char payment pattern

    ,TrailingMatch = CASE MAX(endpos) OVER (PARTITION BY custid, patternid)

    -- When last match ends at the end of the payment pattern, no need to check against leading pattern chars

    WHEN LEN(PaymentPattern)

    THEN NULL

    -- Match leading search characters against trailing payment characters

    WHEN endpos

    THEN CASE RIGHT(PaymentPattern, LEN(PaymentPattern)-(endpos+1))

    WHEN LEFT(SearchPattern, LEN(PaymentPattern)-(endpos+1)) THEN 1

    ELSE 0

    END

    -- Ignore other rows

    ELSE NULL

    END

    FROM

    (

    SELECT custid, patternid, startpos=MIN(a.n), endpos=MAX(a.n)

    ,PaymentPattern = MAX(a.pattern)

    ,SearchPattern = MAX(b.pattern)

    FROM PaymentPatterns a

    JOIN SearchPatterns b

    ON a.p = b.s -- payment pattern char = search pattern char

    GROUP BY custid, patternid, a.n - b.n

    HAVING COUNT(*) = LEN(MAX(b.pattern))

    ) a;

    -- Here's the payment analysis set

    SELECT *

    FROM @PaymentAnalysis

    ORDER BY custid, patternid, startpos;

    -- List "perfectly cyclic" matches (i.e., on a 4 char search string in a 12 character pattern there are 3 rows)

    SELECT custid, patternid

    FROM @PaymentAnalysis a

    --WHERE patternid = 1

    GROUP BY custid, patternid

    HAVING COUNT(*) = 1. * @Periods / (SELECT LEN(b.pattern) FROM @Patterns b WHERE a.patternid = b.patternid);

    -- List "cyclic" matches, meaning leading or trailing characters match the search pattern

    -- but the match count is one less than periods / length of the search pattern

    SELECT custid, patternid

    FROM @PaymentAnalysis a

    --WHERE patternid = 1

    GROUP BY custid, patternid

    HAVING COUNT(*) = CEILING(1. * @Periods / (SELECT LEN(b.pattern) FROM @Patterns b WHERE a.patternid = b.patternid)) - 1 AND

    MAX(GapToPrior) = 0 AND MAX(GapToNext) = 0 AND

    MAX(ISNULL(LeadingMatch, 1)) = 1 AND MAX(ISNULL(TrailingMatch, 1)) = 1;

    -- List customers matching one or more of the patterns but who have changed their pattern

    -- Note: This returns customer 6 twice because he matches both patterns at least once

    SELECT custid, patternid

    FROM @PaymentAnalysis a

    --WHERE patternid = 1

    GROUP BY custid, patternid

    HAVING COUNT(*) < 1. * @Periods / (SELECT LEN(b.pattern) FROM @Patterns b WHERE a.patternid = b.patternid) AND

    (MAX(GapToPrior) = 1 OR MAX(GapToNext) = 1 OR

    MAX(ISNULL(LeadingMatch, 1)) = 0 OR MAX(ISNULL(TrailingMatch, 1)) = 0);

    -- List customers not matching any of the search patterns

    SELECT custid

    FROM @SampleData

    EXCEPT

    SELECT custid

    FROM @PaymentAnalysis;


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (1/7/2015)


    My solution can look for any list of patterns:

    Right, but you need the list.

    I was more asking if he needed the code to identify patterns then look for them, or simply look for one or more specified ones (as his does).

    Of course, you could quickly build out a table of possible patterns given certain reasonable restrictions on the length of the pattern. Just trying to figure out if such a step is necessary.

  • Dwain, Nevyn,

    Wow Dwain that is some complex code! I have only scanned it and like Nevyn says, it requires the user to seed it with a set of predefined patterns. In this instance, that will probably suffice, but it is not a generic solution. I am working with a limited number of characters (U,D,0,-) and so can be definition only have a limited number of patterns.

    I still think that a CLR is the only real solution - but my C# skills are not good enough :crazy:

    For production I have basically punched holes in the set by running a series of pattern like '%UUD%' queries and stored the resulting set with a corhort description. It's not pretty but it gets the job done for the time being...

  • aaron.reese (1/8/2015)


    Dwain, Nevyn,

    Wow Dwain that is some complex code! I have only scanned it and like Nevyn says, it requires the user to seed it with a set of predefined patterns. In this instance, that will probably suffice, but it is not a generic solution. I am working with a limited number of characters (U,D,0,-) and so can be definition only have a limited number of patterns.

    I still think that a CLR is the only real solution - but my C# skills are not good enough :crazy:

    For production I have basically punched holes in the set by running a series of pattern like '%UUD%' queries and stored the resulting set with a corhort description. It's not pretty but it gets the job done for the time being...

    Or you can just use something like this to generate the seed patterns:

    DECLARE @SampleData TABLE

    (

    custid INT PRIMARY KEY

    ,pattern VARCHAR(100)

    );

    -- Payment patterns

    INSERT INTO @SampleData

    -- A "perfectly cyclic" match to patternid = 1

    SELECT 1, 'UUUDUUUDUUUD'

    -- Three "cyclic" matches to patternid = 1

    UNION ALL SELECT 2, 'UDUUUDUUUDUU'

    UNION ALL SELECT 3, 'UUDUUUDUUUDU'

    UNION ALL SELECT 4, 'DUUUDUUUDUUU'

    -- A "cyclic" match to patternid = 2

    UNION ALL SELECT 5, 'UUUUDUUUUDUU'

    -- Matches patternid=1 but there was a delay on the second payment (2 weeks)

    UNION ALL SELECT 6, 'UUUDUUUUUDUU'

    -- A customer that matches none of the patterns

    UNION ALL SELECT 7, 'UDUDUDUDUDUD'

    ;

    SELECT patternid=ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    ,pattern

    FROM

    (

    SELECT pattern=LEFT(pattern, 4)

    FROM @SampleData

    UNION ALL

    SELECT LEFT(pattern, 5)

    FROM @SampleData

    ) a

    GROUP BY pattern;

    Obviously you may want to exclude some of these by looking at the above results and deciding which patterns it makes sense to search for. You already mentioned that you were interested in the UUUD pattern and one that would work for a 4-4-5 monthly payment cycle, so I assumed that meant you did have some idea of what you were interested in identifying.

    And with respect to the complexity, keep in mind that the main method of matching the subsequences developed by Peter Larsson is actually quite simple and also quite fast (his code is amazingly terse). Most of the work overall in what I did is in:

    - Splitting your profiles back to individual rows (which I believe is probably not required if you work directly with your real, underlying data)

    - Adding the logic to calculate GapToPrior, GapToNext, LeadingMatch and TrailingMatch in the analysis set (none of which is going to slow it down very much)

    - Then constructing the queries to pattern match against the analysis set, for which I only provided some basic examples.

    In the end of course I'm happy you got something that works for your case. I found the problem quite interesting and as I said before the most interesting I've seen for the new year.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply