How to locate 3 bad points in 35 minutes, all while avoiding RBAR

  • I was hoping that someone could help me avoid RBAR in my problem. See the code below...

    --===== If the Point table already exists, drop it

    IF OBJECT_ID('TempDB..#PointTable','U') IS NOT NULL

    DROP TABLE #PointTable

    --===== Create the Point table

    CREATE TABLE #PointTable

    (

    ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,

    PointName VARCHAR(50),

    DateValue DATETIME,

    InRange VARCHAR(1))

    -- InRange is 'Y' if the recorded point is good

    -- or 'N' if the recorded point is bad

    --===== Insert the data into the Point table

    --===== For each point, there SHOULD be a record about every 15 minutes.

    INSERT INTO #PointTable

    ( PointName, DateValue, InRange)

    SELECT 'A', '2004-05-23T14:15:01','Y' UNION ALL

    SELECT 'A', '2004-05-23T14:30:10','N' UNION ALL

    -- Note that there is a missing record here.

    SELECT 'A', '2004-05-23T15:00:05','N' UNION ALL

    SELECT 'A', '2004-05-23T15:14:32','Y' UNION ALL

    SELECT 'A', '2004-05-23T15:29:25','Y' UNION ALL

    SELECT 'A', '2004-05-23T15:44:45','N' UNION ALL

    SELECT 'A', '2004-05-23T15:59:01','Y' UNION ALL

    SELECT 'A', '2004-05-23T16:15:10','N' UNION ALL

    SELECT 'A', '2004-05-23T16:30:12','Y' UNION ALL

    SELECT 'A', '2004-05-23T16:46:05','N' UNION ALL

    SELECT 'B', '2004-05-23T14:15:32','N' UNION ALL

    SELECT 'B', '2004-05-23T14:31:25','N' UNION ALL

    SELECT 'B', '2004-05-23T14:45:45','Y' UNION ALL

    SELECT 'B', '2004-05-23T15:00:05','Y' UNION ALL

    SELECT 'C', '2004-05-23T14:15:32','N' UNION ALL

    SELECT 'C', '2004-05-23T14:30:25','N' UNION ALL

    SELECT 'C', '2004-05-23T14:45:45','N' UNION ALL

    SELECT 'C', '2004-05-23T15:00:59','Y'

    I need to locate point failures: that is, get all PointNames having at least 3 'bad' records in any 35 minute window. For the data above, only point 'C' is bad. 'B' only has 2 records within 3 minutes. 'A' has several bad points, but none within 35 minutes of each other.

    I say 35 minutes, because the real requirement is '3 consecutive bad points'. I know that points are recorded about every 15 minutes, but that does not mean that I can rely on the DateValue to be accurate to the millisecond, and there might be missing data. So I figure that if the table contains 3 bad recorded points within 35 minutes, I have concrete proof that there is a failure.

    The current "solution" I have is using nested cursors, but since I could have hundreds of thousands records, I wanted to see it there was a set-based solution.

    Tip: http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]

    This well-written article probably would be useful in solving the problem, but I am too new to writing stored procedures to understand it well enough.

    Thanks! 😀

    Paul DB

  • select *

    from #PointTable PT1

    inner join #PointTable PT2 -- Second failure

    on PT1.PointName = PT2.PointName

    and PT2.DateValue between

    dateadd(minute, 14, PT1.DateValue)

    and

    dateadd(minute, 16, PT1.DateValue)

    inner join #PointTable PT3 -- Third failure

    on PT1.PointName = PT3.PointName

    and PT3.DateValue between

    dateadd(minute, 14, PT2.DateValue)

    and

    dateadd(minute, 16, PT2.DateValue)

    where pt1.inrange = 'n'

    and pt2.inrange = 'n'

    and pt3.inrange = 'n'

    The dateadd range gives it some margin for error on the 15 minutes rule.

    On a large number of rows, having dateadd in the join might slow the whole thing down too much. You could overcome this by either adding a pair of calculated columns, one for each end of the dateadd range, and indexing those, or by creating an indexed view with those columns in it.

    Even with the dateadd in the join, that should be faster than nested cursors.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks for the reply, GSquared. This is much better than what I came up with. :w00t:

    A related question: Isn't "X between Y and Z" a double triangular join, which causes RBAR?

    Referring to http://www.sqlservercentral.com/articles/T-SQL/61539/

    Paul DB

  • Here is another solution sent in a private message.

    SELECT a.PointName

    FROM #PointTable a

    WHERE

    ( SELECT Count(*)

    FROM #PointTable aa

    WHERE

    aa.PointName = a.PointName

    AND aa.DateValue BETWEEN a.DateValue AND DATEADD(MINUTE,35,a.DateValue)

    AND aa.InRange = 'N'

    ) >= 3

    GROUP BY a.PointName

    Paul DB

  • No, that's not a triangular join. It's a linear join. No RBAR in it.

    I just tested it on a copy of your table with an index on DateValue, PointName and InRange, with 10-thousand rows in the table. Used an index scan for the first copy of the table in the select (PT1), and then index seeks for the other two (PT2 and PT3).

    With that, it took 15 milliseconds to resolve, and did find the right rows. Because of the triple-join, it's a little heavy on scans (14), but not bad on reads (74). At least, with the data I generated for it. Your milleage will vary.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • The second solution, the one from the PM, is a triangular join.

    Ran it on the same table and data that I just ran my solution on: 172 milliseconds to resolve, 10019 scans, 40313 logical reads.

    So, it takes more than 10 times longer, and it's pounding the IO like a hyperactive bongo player. If the data were on disk instead of in cache, it would be a mess.

    Both solutions were tested multiple times, to make sure they were operating on cache instead of disk and all that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • How can I tell the difference between a triangular join and a linear join?

    The closest I haev been able to come to understanding this is my reading Jeff Moden's article [url]http://www.sqlservercentral.com/articles/Performance+Tuning/62278/[/url], which states... {emphasis added}

    Tuning Opportunity?

    Let's put that in terms that everyone would appreciate... because there is no triangular join in either code, we can make the assumption that the code is nearly linear in performance. For the Microsoft example, it would take 0.313/17*1000000 milliseconds or 18.4 seconds to update a million rows. Not bad... until you realize that the performance enabled version does it in 0.015/17*1000000 milliseconds or only 0.882 seconds.

    So it seems that all joins that are not triangular joins behave linearly. Is that right?

    Thanks for your comments, GSquared.

    Paul DB

  • A triangular join is generally (loosely) an inline query that references the outer query and more rows, in such a way that it has to query all/most of the outer table for each row.

    For example, the one here, has to query the whole rest of the table for each row, because of the sub-query. That's why it's scan and read counts were so high.

    A linear process doesn't have the correlated sub-query in it that forces multiple scans.

    All the triple-join has to do is join the table to itself, then join it again. That does require a few scans, but only a few, and a few reads, but only a few. Even though it's joining on a range instead of an equality, it's still not that much work for the system.

    If you aren't quite sure if a query is hidden RBAR or not, use "set statistics io on" at the beginning of the command, and look for scans and reads that are incredibly high for the number of rows being returned. Take into account that joins add scans and reads, but not huge numbers.

    Did that clarify anything, or just make it muddier?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • That clarified lots of things. 😀 I did not know how to "set statistics io on" and therefore was never seeing performance messages.

    By the way... What query did you use to generate "10-thousand rows" in the Points table for your performance test? I ask because such a query would be useful in illustrating how I can properly test a query when I do not have many rows of real data in my current table.

    Said differently: I know how to create a temp table with rows I've explicitly added via [SELECT ... UNION ALL SELECT ...]. But I don't know how to add thousands of records with fake yet unique data to that temp table. I'm guessing a tally table is used with incrementing DateValue values and random InRange values, but I can't quite understand it enough to form the query. 😛

    Paul DB

  • I didn't keep the specific script I used to populate the test data.

    Generally, I use a Numbers table (same as a Tally table, just a different name) if I want sequential data, or need a specific number of rows. I use checksum and newID if I want random data.

    For example:

    create table #T (

    ID int identity primary key,

    RandomNumber float,

    RandomLetter char(1),

    SequentialTime datetime);

    declare @letters char(26);

    select @letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';

    insert into #T (RandomNumber, RandomLetter, SequentialTime)

    select checksum(newid()), -- random number

    substring(@Letters, -- random letter

    abs(checksum(newid()))%26 + 1, -- random number between 1 and 26

    1),

    dateadd(minute, number, '1/1/2008') -- sequential minutes

    from dbo.Numbers

    where number between 0 and 1000; -- specific number of rows

    With the general principles in this, you can generate just about any amount of random or semi-random data in whatever format you need.

    My version of the Numbers table goes from 0 to 10-thousand. I have a SmallNumbers table that goes from 0 to 1-thousand, and I used to use a BigNumbers table for 0 to 1-million, but I've found that cross-joining Numbers to itself and using the Row_Number function is faster for numbers over 10-thousand, up to the max value for the Int data type (about 32-billion).

    Some people start their Numbers/Tally table at 1, I start mine at 0. It's just a matter of knowing where you started it, so that your calculations from it start at the right place.

    I use 0 because when I'm adding dates/minutes/hours/whatever using dateadd, I like to start with my actual starting date/time, and that means dateadd zero for the first row. It's just more convenient for me. Other people do it other ways. They're all equally valid mathematically and for SQL, so it's a personal preference thing.

    Does that help?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (9/11/2008)


    Does that help?

    Yes. :Wow:

    Paul DB

  • Good. 🙂

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply