Data Islands and Gaps - How To

  • shaunna (5/22/2012)


    The records are not always one minute apart, they can actually be time stamped fairly irregularly because the polling servers that record the data can have delays before writing to the database. So now it's getting tricky.

    Rather than just being able to partition and count if there is more than ten in a row, I have to know if there is more than a ten minute time difference from the first timestamp and the last timestamp of a consecutive set of records that had greater than 95% utilization (for a given interface of course).

    I was afraid you were going to say that. I think this eliminates the possibilty of doing it with a single table scan. Will look into it if Jeff doesn't beat me to it.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • shaunna (5/22/2012)


    OK, after getting it in place and testing some data something came to light that makes this more challenging (I think:)).

    The records are not always one minute apart, they can actually be time stamped fairly irregularly because the polling servers that record the data can have delays before writing to the database. So now it's getting tricky.

    Rather than just being able to partition and count if there is more than ten in a row, I have to know if there is more than a ten minute time difference from the first timestamp and the last timestamp of a consecutive set of records that had greater than 95% utilization (for a given interface of course).

    I think my brain might explode.

    A small mod to Jeffs query

    WITH

    cteGroup AS

    (

    SELECT InterfaceName, TheDateTime, UtilizationPercent,

    ContiguousGroup =

    ROW_NUMBER() OVER

    (PARTITION BY InterfaceName, CASE WHEN UtilizationPercent > 95 THEN 1 ELSE 0 END

    ORDER BY TheDateTime)

    - ROW_NUMBER() OVER

    (ORDER BY InterFaceName,TheDateTime)

    FROM @t

    )

    SELECT InterfaceName,

    IntervalStart = MIN(TheDateTime),

    IntervalEnd = MAX(TheDateTime)

    FROM cteGroup

    WHERE UtilizationPercent > 95

    GROUP BY InterfaceName,ContiguousGroup

    HAVING DATEDIFF(Minute,MIN(TheDateTime),MAX(TheDateTime)) > 10;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Thank you, the last solution is working great!

  • Hopefully even though this is a reply to an old post; I hope someone may read it and assist me in understanding what I am doing wrong to find the data islands. What do I not understand correctly about how this is supposed to work.

    I have a lot of tags (almost 2 million) in a table that are identified by a prefix + a number. I need to find the groups (data islands) with a starting prefix + number to prefix + number. I tried to use the code shown below. The table has each tag defined by a TagNumber varchar(38), Prefix varchar(7), Number numeric(38,0). When I run the query, it is not even close to what I expected or need.

    using the samle data, this is what is expected:

    start_range end_range

    -------------------------------------- --------------------------------------

    840003008335608 840003008335613

    840003108335614 840003108335617

    91BDO2751 91BDO2753

    91CQO2754 91CQO2756

    91AEO2757 91AEO2760

    and this is what I get:

    start_range end_range

    -------------------------------------- --------------------------------------

    840003008335608 840003108335617

    91BDO2751 840003108335617

    840003008335608 91AEO2760

    91BDO2751 91AEO2760

    When I remove the Partition By Prefix I get:

    start_range end_range

    -------------------------------------- --------------------------------------

    840003008335608 840003108335617

    91BDO2751 91AEO2760

    Neither of the above is correct. What is wrong with the query I have built please?

    *********************************************

    For the sake of a similar sample I have the following:

    DECLARE @tags TABLE(TagNumber varchar(38), Prefix varchar(7),Number numeric(38,0))

    INSERT INTO @tags(TagNumber, Prefix,Number)

    VALUES('840003008335608','8400030',8335608)

    ,('840003008335609','8400030',8335609)

    ,('840003008335610','8400030',8335610)

    ,('840003008335611','8400030',8335611)

    ,('840003008335612','8400030',8335612)

    ,('840003008335613','8400030',8335613)

    ,('840003108335614','8400031',8335614)

    ,('840003108335615','8400031',8335615)

    ,('840003108335616','8400031',8335616)

    ,('840003108335617','8400031',8335617)

    ,('91BDO2751','91BDO',2751)

    ,('91BDO2752','91BDO',2752)

    ,('91BDO2753','91BDO',2753)

    ,('91CQO2754','91CQO',2754)

    ,('91CQO2755','91CQO',2755)

    ,('91CQO2756','91CQO',2756)

    ,('91AEO2757','91AEO',2757)

    ,('91AEO2758','91AEO',2758)

    ,('91AEO2759','91AEO',2759)

    ,('91AEO2760','91AEO',2760);

    WITH StartingPoints AS

    (

    SELECT A.TagNumber, ROW_NUMBER() OVER(PARTITION BY Prefix ORDER BY A.TagNumber) AS rownum

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT *

    FROM @tags AS B

    WHERE B.Number = A.Number - 1)

    ),

    EndingPoints AS

    (

    SELECT A.TagNumber, ROW_NUMBER() OVER(PARTITION BY Prefix ORDER BY A.TagNumber) AS rownum

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT *

    FROM @tags AS B

    WHERE B.Number = A.Number + 1)

    )

    SELECT

    S.TagNumber AS start_range

    , E.TagNumber AS end_range

    FROM StartingPoints AS S

    JOIN EndingPoints AS E ON E.rownum = S.rownum

    ORDER BY 2;

  • In the future, you'd probably be best served by just opening a new thread with your question.

    For now, though, here's my best explanation of why your query isn't doing what you want.

    Your first CTE is grabbing all the rows from @tags for which there is not a row with the immediately preceding number. Effectively, you're grabbing the lowest value for each consecutive set of values for Number.

    In the sample data, there are two sets of consecutive numbers, starting with 8335608 and 2751 respectively. The first CTE returns those two rows.

    The second CTE does the same as the first, only it grabs the final number of each consecutive set of numbers. Again, there are two sets of consecutive numbers in the sample data, ending with 8335617 and 2760 respectively. The second CTE returns those two rows.

    All of those rows in the two CTEs will have a rownum of 1, since they are each the only rows in their result set with that prefix.

    So, when you do an inner join of the two CTEs using rownum, you just get all 4 combinations of the rows from each CTE.

    The main problem with your code is that it's not prefix-aware. You're just grabbing the beginning and end of each consecutive set of numbers, no matter what prefix they're associated with.

    If you make sure you grab the first and last value for each consecutive set per prefix in the CTEs, and then also add that to the join condition in the final query, you should get the results you want.

    Something like this should give you ideas to play with:

    DECLARE @tags TABLE(TagNumber varchar(38), Prefix varchar(7),Number numeric(38,0))

    INSERT INTO @tags(TagNumber, Prefix,Number)

    VALUES('840003008335608','8400030',8335608)

    ,('840003008335609','8400030',8335609)

    ,('840003008335610','8400030',8335610)

    ,('840003008335611','8400030',8335611)

    ,('840003008335612','8400030',8335612)

    ,('840003008335613','8400030',8335613)

    ,('840003108335614','8400031',8335614)

    ,('840003108335615','8400031',8335615)

    ,('840003108335616','8400031',8335616)

    ,('840003108335617','8400031',8335617)

    ,('91BDO2751','91BDO',2751)

    ,('91BDO2752','91BDO',2752)

    ,('91BDO2753','91BDO',2753)

    ,('91CQO2754','91CQO',2754)

    ,('91CQO2755','91CQO',2755)

    ,('91CQO2756','91CQO',2756)

    ,('91AEO2757','91AEO',2757)

    ,('91AEO2758','91AEO',2758)

    ,('91AEO2759','91AEO',2759)

    ,('91AEO2760','91AEO',2760);

    --In the sample data, each prefix just has one

    --block of consecutive numbers, so there are no

    --gaps/islands to account for. Because of that,

    --this would even work for the sample data,

    --but won't work if there are gaps/islands

    SELECT Prefix,

    Start_Range=MIN(TagNumber),

    End_Range=MAX(TagNumber)

    FROM @tags

    GROUP BY Prefix;

    --If we assume that the actual data can have gaps/islands

    --in the numbers for each prefix, then something like

    --this would work, although it's not particularly efficient.

    WITH GroupedTags AS (

    SELECT RN=ROW_NUMBER() OVER (PARTITION BY Prefix ORDER BY Number asc),

    RelativeToBase=Number-MIN(Number) OVER (PARTITION BY Prefix),

    TagNumber,

    Prefix

    FROM @tags)

    SELECT Prefix,

    Start_Range=MIN(TagNumber),

    End_Range=MAX(TagNumber)

    FROM GroupedTags

    GROUP BY Prefix, (RelativeToBase-RN)

    ORDER BY Prefix ASC, Start_Range ASC;

    --A version of your original code,

    --now made prefix-aware.

    WITH StartingPoints AS

    (

    SELECT A.TagNumber, ROW_NUMBER() OVER(PARTITION BY Prefix ORDER BY A.TagNumber) AS rownum,

    Prefix

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT *

    FROM @tags AS B

    WHERE B.Number = A.Number - 1

    AND B.Prefix=A.Prefix)

    ),

    EndingPoints AS

    (

    SELECT A.TagNumber, ROW_NUMBER() OVER(PARTITION BY Prefix ORDER BY A.TagNumber) AS rownum,

    Prefix

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT *

    FROM @tags AS B

    WHERE B.Number = A.Number + 1

    AND B.Prefix=A.Prefix)

    )

    SELECT

    S.TagNumber AS start_range

    , E.TagNumber AS end_range

    FROM StartingPoints AS S

    JOIN EndingPoints AS E ON E.rownum = S.rownum

    AND S.Prefix=E.Prefix

    ORDER BY end_range;

    --It's generally preferred to specify column names in the ORDER BY,

    --and not just the ordinal position, even though that's possible,

    --so I changed that bit. It just makes the code easier to read and a bit more robust

    --(in case you add columns to the result set, for example)

    Cheers!

  • Ed,

    Applying the final query from the previous posts, to your situation, yields this:

    DECLARE @tags AS TABLE (

    TagNumber varchar(38),

    Prefix varchar(7),

    Number numeric(38,0)

    );

    INSERT INTO @tags (TagNumber, Prefix, Number)

    VALUES

    ('840003008335608','8400030',8335608)

    ,('840003008335609','8400030',8335609)

    ,('840003008335610','8400030',8335610)

    ,('840003008335611','8400030',8335611)

    ,('840003008335612','8400030',8335612)

    ,('840003008335613','8400030',8335613)

    ,('840003108335614','8400031',8335614)

    ,('840003108335615','8400031',8335615)

    ,('840003108335616','8400031',8335616)

    ,('840003108335617','8400031',8335617)

    ,('91BDO2751','91BDO',2751)

    ,('91BDO2752','91BDO',2752)

    ,('91BDO2753','91BDO',2753)

    ,('91CQO2754','91CQO',2754)

    ,('91CQO2755','91CQO',2755)

    ,('91CQO2756','91CQO',2756)

    ,('91AEO2757','91AEO',2757)

    ,('91AEO2758','91AEO',2758)

    ,('91AEO2759','91AEO',2759)

    ,('91AEO2760','91AEO',2760);

    WITH cteGroup AS (

    SELECT T1.*,

    ConsecutiveGroup = ROW_NUMBER() OVER(PARTITION BY T1.Prefix,

    CASE WHEN EXISTS (SELECT 1 FROM @tags AS T2 WHERE T2.Prefix = T1.Prefix AND T1.Number IN (T2.Number - 1, T2.Number + 1)) THEN 1 ELSE 0 END

    ORDER BY T1.Number) - ROW_NUMBER() OVER(ORDER BY T1.Prefix, T1.Number)

    FROM @tags AS T1

    )

    SELECT MIN(TagNumber) AS Start_Range,

    MAX(TagNumber) AS End_Range

    FROM cteGroup

    GROUP BY Prefix, ConsecutiveGroup

    ORDER BY MIN(TagNumber)

    Test and let me know...

    EDIT: forgot to include ConsecutiveGroup in the GROUP BY.

    Steve (aka sgmunson) πŸ™‚ πŸ™‚ πŸ™‚
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson (7/10/2015)


    Ed,

    Applying the final query from the previous posts, to your situation, yields this:

    DECLARE @tags AS TABLE (

    TagNumber varchar(38),

    Prefix varchar(7),

    Number numeric(38,0)

    );

    INSERT INTO @tags (TagNumber, Prefix, Number)

    VALUES

    ('840003008335608','8400030',8335608)

    ,('840003008335609','8400030',8335609)

    ,('840003008335610','8400030',8335610)

    ,('840003008335611','8400030',8335611)

    ,('840003008335612','8400030',8335612)

    ,('840003008335613','8400030',8335613)

    ,('840003108335614','8400031',8335614)

    ,('840003108335615','8400031',8335615)

    ,('840003108335616','8400031',8335616)

    ,('840003108335617','8400031',8335617)

    ,('91BDO2751','91BDO',2751)

    ,('91BDO2752','91BDO',2752)

    ,('91BDO2753','91BDO',2753)

    ,('91CQO2754','91CQO',2754)

    ,('91CQO2755','91CQO',2755)

    ,('91CQO2756','91CQO',2756)

    ,('91AEO2757','91AEO',2757)

    ,('91AEO2758','91AEO',2758)

    ,('91AEO2759','91AEO',2759)

    ,('91AEO2760','91AEO',2760);

    WITH cteGroup AS (

    SELECT T1.*,

    ConsecutiveGroup = ROW_NUMBER() OVER(PARTITION BY T1.Prefix,

    CASE WHEN EXISTS (SELECT 1 FROM @tags AS T2 WHERE T2.Prefix = T1.Prefix AND T1.Number IN (T2.Number - 1, T2.Number + 1)) THEN 1 ELSE 0 END

    ORDER BY T1.Number) - ROW_NUMBER() OVER(ORDER BY T1.Prefix, T1.Number)

    FROM @tags AS T1

    )

    SELECT MIN(TagNumber) AS Start_Range,

    MAX(TagNumber) AS End_Range

    FROM cteGroup

    GROUP BY Prefix, ConsecutiveGroup

    ORDER BY MIN(TagNumber)

    Test and let me know...

    EDIT: forgot to include ConsecutiveGroup in the GROUP BY.

    Alas, this won't work if there's more than one consecutive group for a prefix, as some updated sample data shows. Also, I realized that my brain had gone on autopilot in my previous reply, and there's absolutely no reason I had to do the MIN aggregation. Just got it in my head that I would do it that way, and all reason went out the window! :blush:

    On the brighter side, that makes it a much more efficient query.

    Here's the updated sample data and code:

    DECLARE @tags TABLE(TagNumber varchar(38), Prefix varchar(7),Number numeric(38,0))

    INSERT INTO @tags(TagNumber, Prefix,Number)

    VALUES('840003008335608','8400030',8335608)

    ,('840003008335609','8400030',8335609)

    ,('840003008335610','8400030',8335610)

    ,('840003008335611','8400030',8335611)

    ,('840003008335612','8400030',8335612)

    ,('840003008335613','8400030',8335613)

    ,('840003108335614','8400031',8335614)

    ,('840003108335615','8400031',8335615)

    ,('840003108335616','8400031',8335616)

    ,('840003108335617','8400031',8335617)

    ,('840003108335619','8400031',8335619) --This row and the next one were added to give another consecutive group for one prefix

    ,('840003108335620','8400031',8335620)

    ,('91BDO2751','91BDO',2751)

    ,('91BDO2752','91BDO',2752)

    ,('91BDO2753','91BDO',2753)

    ,('91CQO2754','91CQO',2754)

    ,('91CQO2755','91CQO',2755)

    ,('91CQO2756','91CQO',2756)

    ,('91AEO2757','91AEO',2757)

    ,('91AEO2758','91AEO',2758)

    ,('91AEO2759','91AEO',2759)

    ,('91AEO2760','91AEO',2760);

    --In the original sample data, each prefix just has oneΒ 

    --block of consecutive numbers, so there are no

    --gaps/islands to account for. Because of that,

    --this would even work for the sample data,

    --but won't work now that there are gaps/islands

    SELECT Β Prefix,

    Start_Range=MIN(TagNumber),

    End_Range=MAX(TagNumber)

    FROM @tags

    GROUP BY Prefix;

    --This approach also won't work now

    WITH cteGroup AS (

    SELECT T1.*,

    ConsecutiveGroup = ROW_NUMBER() OVER(PARTITION BY T1.Prefix,

    CASE WHEN EXISTS (SELECT 1 FROM @tags AS T2 WHERE T2.Prefix = T1.Prefix AND T1.Number IN (T2.Number - 1, T2.Number + 1)) THEN 1 ELSE 0 END

    ORDER BY T1.Number) - ROW_NUMBER() OVER(ORDER BY T1.Prefix, T1.Number)

    FROM @tags AS T1

    )

    SELECT MIN(TagNumber) AS Start_Range,

    MAX(TagNumber) AS End_Range

    FROM cteGroup

    GROUP BY Prefix, ConsecutiveGroup

    ORDER BY MIN(TagNumber);

    --This is now many times more efficient, because I

    --realized a silly error I made before.

    --The MIN aggregation was completely unnecessary

    --I can just compare RN to the value for Number

    --There's no reason to compare current Number to the MIN

    WITH GroupedTags AS (

    SELECT RN=ROW_NUMBER() OVER (PARTITION BY Prefix ORDER BY Number asc),

    Number,

    TagNumber,

    Prefix

    FROM @tags)

    SELECT Prefix,

    Start_Range=MIN(TagNumber),

    End_Range=MAX(TagNumber)

    FROM GroupedTags

    GROUP BY Prefix, (RN-Number)

    ORDER BY Prefix ASC, Start_Range ASC;

    --A version of your original code,

    --now made prefix-aware.

    WITH StartingPoints AS

    (

    SELECT A.TagNumber, ROW_NUMBER() OVER(PARTITION BY Prefix ORDER BY A.TagNumber) AS rownum,

    Prefix

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT *

    FROM @tags AS B

    WHERE B.Number = A.Number - 1

    AND B.Prefix=A.Prefix)

    ),

    EndingPoints AS

    (

    SELECT A.TagNumber, ROW_NUMBER() OVER(PARTITION BY Prefix ORDER BY A.TagNumber) AS rownum,

    Prefix

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT *

    FROM @tags AS B

    WHERE B.Number = A.Number + 1

    AND B.Prefix=A.Prefix)

    )

    SELECT

    S.TagNumber AS start_range

    , E.TagNumber AS end_range

    FROM StartingPoints AS S

    JOIN EndingPoints AS E ON E.rownum = S.rownum

    AND S.Prefix=E.Prefix

    ORDER BY end_range;Β 

    --It's generally preferred to specify column names in the ORDER BY,Β 

    --and not just the ordinal position, even though that's possible,

    --so I changed that bit. It just makes the code easier to read and a bit more robust

    --(in case you add columns to the result set, for example)

    Cheers!

  • Jacob,

    Thanks for finding the boo-boo in my code. I didn't quite have the right condition specified, and with that method, the condition becoming correct would be ridiculouly complex, and possibly impossible to test for. Anyway, I really like to pretty up working code, so I took your code and tested it against my sample data, and it failed first time through, so I started digging, and discovered I had mucked up the added sample data. I then fixed the sample data and finished the prettying...

    DECLARE @tags AS TABLE (

    TagNumber varchar(38),

    Prefix varchar(7),

    Number numeric(38,0)

    );

    INSERT INTO @tags (TagNumber, Prefix, Number)

    VALUES

    ('840003008335608','8400030',8335608)

    ,('840003008335609','8400030',8335609)

    ,('840003008335610','8400030',8335610)

    ,('840003008335611','8400030',8335611)

    ,('840003008335612','8400030',8335612)

    ,('840003008335613','8400030',8335613)

    ,('840003108335614','8400031',8335614)

    ,('840003108335615','8400031',8335615)

    ,('840003108335616','8400031',8335616)

    ,('840003108335617','8400031',8335617)

    ,('840003108335619','8400031',8335619)-- Added for testing

    ,('840003108335620','8400031',8335620)-- Added for testing

    ,('91BDO2751','91BDO',2751)

    ,('91BDO2752','91BDO',2752)

    ,('91BDO2753','91BDO',2753)

    ,('91CQO2754','91CQO',2754)

    ,('91CQO2755','91CQO',2755)

    ,('91CQO2756','91CQO',2756)

    ,('91AEO2757','91AEO',2757)

    ,('91AEO2758','91AEO',2758)

    ,('91AEO2759','91AEO',2759)

    ,('91AEO2760','91AEO',2760);

    WITH StartingPoints AS (

    SELECT A.TagNumber,

    ROW_NUMBER() OVER(PARTITION BY A.Prefix ORDER BY A.TagNumber) AS RowNum, A.Prefix

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT 1

    FROM @tags AS B

    WHERE B.Prefix = A.Prefix

    AND B.Number = A.Number - 1

    )

    ),

    EndingPoints AS (

    SELECT A.TagNumber,

    ROW_NUMBER() OVER(PARTITION BY Prefix ORDER BY A.TagNumber) AS RowNum, Prefix

    FROM @tags AS A

    WHERE NOT EXISTS (

    SELECT 1

    FROM @tags AS B

    WHERE B.Number = A.Number + 1

    AND B.Prefix = A.Prefix

    )

    )

    SELECT S.TagNumber AS Start_Range, E.TagNumber AS End_Range

    FROM StartingPoints AS S

    INNER JOIN EndingPoints AS E

    ON S.Prefix = E.Prefix

    AND S.RowNum = E.RowNum

    ORDER BY Start_Range, End_Range;

    I liked your code because I had first thought I might need to do it this way, and then had this flash of brilliance, that turned out to be kinda dim in the long run.

    Steve (aka sgmunson) πŸ™‚ πŸ™‚ πŸ™‚
    Rent Servers for Income (picks and shovels strategy)

  • Thank you both so much for the assistance. I will look the code over to see if that resolves my problem as well as improves my understanding of the use of the CTE with the Row_Number Over(). I kept playing with the queries while waiting for a response and found that I could get the tags which had the prefix containing letters; however, whenever it tried to get the islands where the tagNumber was all numbers (like 840003008335608) the results always come back with a list of individual rows where the start_range was equal to the end_range.

    But, I will try both of your examples to see if it will solve my problem.

  • Steve,

    No problem! A second set of eyes is always nice to have, especially towards the end of a Friday πŸ™‚

    Of the two working queries I posted, it's still a bit funny to me how much more efficient the first one got when I removed the superfluous MIN. It seems to be getting it all in one scan, which is nice, and I like how concise it looks.

    Having said that, it's also not quite as obvious what it's doing as the fixed (and prettified, thank you!) version of the OP's code.

    Cheers!

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply