How to search for a pattern of consecutive records?

  • Happy New Year!

    I tried to find a solution for an unusual problem. I have following data:

    1cata

    2catb

    3catc

    4cata

    5catc

    1doga

    2dogc

    3dogb

    4dogd

    4doga

    5dogc

    6dogc

    1panthera

    2panthera

    3pantherb

    4pantherc

    5panthera

    6pantherc

    7pantherb

    8panthera

    9pantherc

    10pantherb

    11pantherd

    I have to find last record of a consecutive pattern. Pattern can be any length from 1 - n.

    Assuming I'm looking for a pattern like:

    a c b

    As an answer I need to get following records:

    3dogb

    7pantherb

    10pantherb

    Assuming I'm looking for a pattern like:

    a c b d

    As an answer I need to get:

    4dogd

    11pantherd

    I spent some time looking for solutions. Honestly - I can't find one for now. Maybe it is just post Christmas / New Year brake slow down 🙂

    Help me please,

    Sebastian

  • First you need to identify the prior n rows for each set (dog, cat, etc.) for each entry. That can be done by using row_number and partitioning by the dog/cat/panther column, ordering by the number column.

    Then, use the FOR XML PATH trick to concatenate the abc column into a string for each row.

    Then compare that to the pattern.

    The first step could be done in a CTE or temp table, then the second step would be done by cross applying that to itself using the length of the pattern in a TOP statement.

    However, this kind of thing is something that won't perform well on any sizeable amount of data.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks,

    Although that is the problem... Amount of data is huge. Not in gigabytes but in records. I tried doing something like this with a recursive stored procedure.

    Due to number of records I've been wondering about as simple solution as it could possibly be.

    What if... I’d add additional column in source data. By default it would be NULL. In stored procedure I would add very simple loop going from the beginning to the end of the table. That loop will store concatenated n consecutive values in a string. In every row I will add current value and remove first one. So I will have n-length pattern in new column. Resetting the string every time cat changes to dog or panther etc.

    Than – all I have to do is just filter for rows with column 4 equal to the pattern.

    1cataa

    2catbab

    3catcabc

    4catabca

    5catccac

    1dogaa

    2dogcac

    3dogbacb

    4dogdcbd

    4dogabda

    5dogcdac

    6dogcacc

    1pantheraa

    2pantheraaa

    3pantherbaab

    4panthercabc

    5pantherabca

    6pantherccac

    7pantherbacb

    8pantheracba

    9panthercbac

    10pantherbacb

    11pantherdcbd

    Do you think it will be time / resources consuming task? I’m afraid it may generate huge mess in db…

    Thanks,

    Sebastian

  • That would work. Do it in a temp table instead of the real table, and query the temp table. That way, you don't mess up the database.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Yep, you are right about the temp table.

    Do you think it will not be performance killer?

  • This works for me on the sample data. Performance might be a problem on large datasets.

    CREATE TABLE #tbl1(number INT , animal VARCHAR(10) , grade CHAR(1) )

    INSERT INTO #tbl1

    select 1,'cat','a'

    UNION ALL select 2,'cat','b'

    UNION ALL select 3,'cat','c'

    UNION ALL select 4,'cat','a'

    UNION ALL select 5,'cat','c'

    UNION ALL select 1,'dog','a'

    UNION ALL select 2,'dog','c'

    UNION ALL select 3,'dog','b'

    UNION ALL select 4,'dog','d'

    UNION ALL select 4,'dog','a'

    UNION ALL select 5,'dog','c'

    UNION ALL select 6,'dog','c'

    UNION ALL select 1,'panther','a'

    UNION ALL select 2,'panther','a'

    UNION ALL select 3,'panther','b'

    UNION ALL select 4,'panther','c'

    UNION ALL select 5,'panther','a'

    UNION ALL select 6,'panther','c'

    UNION ALL select 7,'panther','b'

    UNION ALL select 8,'panther','a'

    UNION ALL select 9,'panther','c'

    UNION ALL select 10,'panther','b'

    UNION ALL select 11,'panther','d'

    DECLARE @SEARCH VARCHAR(100)

    SELECT @SEARCH = --'a c b'

    'a c b d'

    ;WITH

    CTE_ROWID

    AS

    (

    SELECT

    ROW_NUMBER() OVER (PARTITION BY animal ORDER BY animal) as RW

    , *

    FROM #tbl1

    )

    ,CTE_CONCAT

    AS

    (

    SELECT

    CURR.RWAS RW

    ,CURR.NumberAS ORIGNUMBER

    ,CURR.gradeAS ORIGGRADE

    ,CURR.animal

    ,CAST(CURR.grade AS NVARCHAR(200)) as grade

    FROM

    CTE_ROWID CURR

    UNION ALL

    SELECT

    PREV.RW

    ,CURR.ORIGNUMBERAS ORIGNUMBER

    ,CURR.ORIGGRADEAS ORIGGRADE

    ,CURR.animal

    ,CAST(PREV.grade + ' ' + CURR.grade AS NVARCHAR(200)) as grade

    FROM

    CTE_CONCAT CURR

    INNER JOIN CTE_ROWID PREVON CURR.animal = prev.animal

    and CURR.RW = PREV.RW + 1

    )

    SELECT

    CTE_CONCAT.ORIGNUMBERas number

    ,CTE_CONCAT.animal

    ,CTE_CONCAT.ORIGGRADEas grade

    FROM CTE_CONCAT

    WHERE

    grade = @SEARCH

    drop table #tbl1

    If you are using a temp table you might be able to trim the results to just the length of the pattern you are searching for. You can also add the level of recursion to the CTE based query and add it to the where clause.

  • Thanks guys for your help.

    Yesterday I tried to solve that problem. With your suggestions I was moving forward. But... I got additional requirement now. The requirement is:

    pattern may contain ranges e.g.

    a [a-c] b d

    Since I have all alphabet in the use it can be even:

    [a-c] [b-c] [g-l] [y-z]

    Argh...

    But it is not a problem... It is a challenge :w00t:

  • Sebastian,

    In the sample data you included this:

    1 dog a

    2 dog c

    3 dog b

    4 dog d

    4 dog a

    5 dog c

    6 dog c

    Note, the 4 for dog d and a is repeated. Is this a typo, or is does this row numbering actually repeat? The reason I'm asking is order control, and if it repeats, you can't be sure by any ordering if "4 dog d" or "4 dog a" comes first except dumb luck and it'll need another external control.

    I may have a reasonably fast method to deal with this otherwise, however, I'll need to know that first.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Good point,

    It is an error. Sorry for that. I made the sample data manually.

  • First, tertiusdp needs to be thanked for making consumable sample data.

    Now, for the code. I modified the sample data to remove the typo:

    IF OBJECT_ID('tempdb..#tbl1') IS NOT NULL

    DROP TABLE #tbl1

    IF OBJECT_ID('tempdb..#tmp') IS NOT NULL

    DROP TABLE #tmp

    IF OBJECT_ID('tempdb..#tmp2') IS NOT NULL

    DROP TABLE #tmp2

    CREATE TABLE #tbl1(number INT , animal VARCHAR(10) , grade CHAR(1) )

    INSERT INTO #tbl1

    select 1,'cat','a'

    UNION ALL select 2,'cat','b'

    UNION ALL select 3,'cat','c'

    UNION ALL select 4,'cat','a'

    UNION ALL select 5,'cat','c'

    UNION ALL select 1,'dog','a'

    UNION ALL select 2,'dog','c'

    UNION ALL select 3,'dog','b'

    UNION ALL select 4,'dog','d'

    UNION ALL select 5,'dog','a'

    UNION ALL select 6,'dog','c'

    UNION ALL select 7,'dog','c'

    UNION ALL select 1,'panther','a'

    UNION ALL select 2,'panther','a'

    UNION ALL select 3,'panther','b'

    UNION ALL select 4,'panther','c'

    UNION ALL select 5,'panther','a'

    UNION ALL select 6,'panther','c'

    UNION ALL select 7,'panther','b'

    UNION ALL select 8,'panther','a'

    UNION ALL select 9,'panther','c'

    UNION ALL select 10,'panther','b'

    UNION ALL select 11,'panther','d'

    This first component simply pivots the pattern field sideways.

    This article will help you understand what it's doing if you require it:

    http://www.sqlservercentral.com/articles/FOR+XML+PATH/70203/

    SELECTDISTINCT animal,

    REPLACE((SELECTgrade AS 'data()'

    FROM#tbl1 AS t2

    WHEREt1.animal = t2.animal

    ORDER BY number

    FOR XML PATH('')

    ), ' ', '') AS Pattern

    INTO

    #tmp

    FROM

    #tbl1 AS t1

    SELECT * FROM #tmp

    Next, we need to locate the positions where your pattern occurs. You'll have to use @PatLen as an input instead of deriving it. This uses a Tally table. You can find an explanation of that in my signature, but it's basically just a stored list of Numbers you can use instead of a counter loop, and runs much faster.

    What we're doing is locating when the pattern exists in each subset of data (in this case, 4 characters, starting from each position), finding that pattern, then subtracting 1. This is the row position of the END of the pattern chain.

    DECLARE @pattern VARCHAR(100),

    @PatLen INT

    SET @pattern = 'a[a-c]bd'

    --SET @PatLen = LEN( @pattern)

    SET @PatLen = 4

    SELECT

    t.Animal,

    tl.N + @PatLen -1 AS RowNumToFind

    INTO

    #tmp2

    FROM

    #tmp AS t,

    tempdb..Tally AS tl

    WHERE

    tl.N < LEN( t.Pattern)

    AND SUBSTRING( t.Pattern, tl.N, @PatLen) like @Pattern

    Finally, apply rownumbers to each partition of animal, and reference our previously located pattern ends.

    SELECT

    drv.*

    FROM

    (SELECT

    *,

    ROW_NUMBER() OVER ( PARTITION BY tb.Animal ORDER BY tb.number) AS RowNum

    FROM

    #tbl1 AS tb

    ) AS drv

    JOIN

    #tmp2 AS t2

    ONdrv.Animal = t2.Animal

    AND drv.RowNum = t2.RowNumToFind

    I broke it up to make it easier to see each component. You could subquery the entire lot into a single query, however, I would recommend splitting at the #tmp2 portion either way, to save yourself extra overhead.

    Let me know if that does what you need.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig,

    It works. I have to make some tweaks. E.g. pattern comes from SELECT from another table and I have to add some more conditions but it does what it should. I'm going to test it on larger amount of data. It has to be available almost instantly.

    It is part of a process that selects n following rows and then gives a kind of histogram.

    Fun, fun, fun 🙂

    Thank all of you guys. It was and still is interesting challenge. Senores, I owe you cerveza. 😉

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply