September 9, 2008 at 10:22 am
I was hoping that someone could help me avoid RBAR in my problem. See the code below...
--===== If the Point table already exists, drop it
IF OBJECT_ID('TempDB..#PointTable','U') IS NOT NULL
DROP TABLE #PointTable
--===== Create the Point table
CREATE TABLE #PointTable
(
ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
PointName VARCHAR(50),
DateValue DATETIME,
InRange VARCHAR(1))
-- InRange is 'Y' if the recorded point is good
-- or 'N' if the recorded point is bad
--===== Insert the data into the Point table
--===== For each point, there SHOULD be a record about every 15 minutes.
INSERT INTO #PointTable
( PointName, DateValue, InRange)
SELECT 'A', '2004-05-23T14:15:01','Y' UNION ALL
SELECT 'A', '2004-05-23T14:30:10','N' UNION ALL
-- Note that there is a missing record here.
SELECT 'A', '2004-05-23T15:00:05','N' UNION ALL
SELECT 'A', '2004-05-23T15:14:32','Y' UNION ALL
SELECT 'A', '2004-05-23T15:29:25','Y' UNION ALL
SELECT 'A', '2004-05-23T15:44:45','N' UNION ALL
SELECT 'A', '2004-05-23T15:59:01','Y' UNION ALL
SELECT 'A', '2004-05-23T16:15:10','N' UNION ALL
SELECT 'A', '2004-05-23T16:30:12','Y' UNION ALL
SELECT 'A', '2004-05-23T16:46:05','N' UNION ALL
SELECT 'B', '2004-05-23T14:15:32','N' UNION ALL
SELECT 'B', '2004-05-23T14:31:25','N' UNION ALL
SELECT 'B', '2004-05-23T14:45:45','Y' UNION ALL
SELECT 'B', '2004-05-23T15:00:05','Y' UNION ALL
SELECT 'C', '2004-05-23T14:15:32','N' UNION ALL
SELECT 'C', '2004-05-23T14:30:25','N' UNION ALL
SELECT 'C', '2004-05-23T14:45:45','N' UNION ALL
SELECT 'C', '2004-05-23T15:00:59','Y'
I need to locate point failures: that is, get all PointNames having at least 3 'bad' records in any 35 minute window. For the data above, only point 'C' is bad. 'B' only has 2 records within 3 minutes. 'A' has several bad points, but none within 35 minutes of each other.
I say 35 minutes, because the real requirement is '3 consecutive bad points'. I know that points are recorded about every 15 minutes, but that does not mean that I can rely on the DateValue to be accurate to the millisecond, and there might be missing data. So I figure that if the table contains 3 bad recorded points within 35 minutes, I have concrete proof that there is a failure.
The current "solution" I have is using nested cursors, but since I could have hundreds of thousands records, I wanted to see it there was a set-based solution.
Tip: http://www.sqlservercentral.com/articles/Advanced+Querying/61716/[/url]
This well-written article probably would be useful in solving the problem, but I am too new to writing stored procedures to understand it well enough.
Thanks! 😀
Paul DB
September 9, 2008 at 12:26 pm
select *
from #PointTable PT1
inner join #PointTable PT2 -- Second failure
on PT1.PointName = PT2.PointName
and PT2.DateValue between
dateadd(minute, 14, PT1.DateValue)
and
dateadd(minute, 16, PT1.DateValue)
inner join #PointTable PT3 -- Third failure
on PT1.PointName = PT3.PointName
and PT3.DateValue between
dateadd(minute, 14, PT2.DateValue)
and
dateadd(minute, 16, PT2.DateValue)
where pt1.inrange = 'n'
and pt2.inrange = 'n'
and pt3.inrange = 'n'
The dateadd range gives it some margin for error on the 15 minutes rule.
On a large number of rows, having dateadd in the join might slow the whole thing down too much. You could overcome this by either adding a pair of calculated columns, one for each end of the dateadd range, and indexing those, or by creating an indexed view with those columns in it.
Even with the dateadd in the join, that should be faster than nested cursors.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
September 9, 2008 at 2:54 pm
Thanks for the reply, GSquared. This is much better than what I came up with. :w00t:
A related question: Isn't "X between Y and Z" a double triangular join, which causes RBAR?
Referring to http://www.sqlservercentral.com/articles/T-SQL/61539/
Paul DB
September 10, 2008 at 9:27 am
Here is another solution sent in a private message.
SELECT a.PointName
FROM #PointTable a
WHERE
( SELECT Count(*)
FROM #PointTable aa
WHERE
aa.PointName = a.PointName
AND aa.DateValue BETWEEN a.DateValue AND DATEADD(MINUTE,35,a.DateValue)
AND aa.InRange = 'N'
) >= 3
GROUP BY a.PointName
Paul DB
September 10, 2008 at 11:38 am
No, that's not a triangular join. It's a linear join. No RBAR in it.
I just tested it on a copy of your table with an index on DateValue, PointName and InRange, with 10-thousand rows in the table. Used an index scan for the first copy of the table in the select (PT1), and then index seeks for the other two (PT2 and PT3).
With that, it took 15 milliseconds to resolve, and did find the right rows. Because of the triple-join, it's a little heavy on scans (14), but not bad on reads (74). At least, with the data I generated for it. Your milleage will vary.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
September 10, 2008 at 11:43 am
The second solution, the one from the PM, is a triangular join.
Ran it on the same table and data that I just ran my solution on: 172 milliseconds to resolve, 10019 scans, 40313 logical reads.
So, it takes more than 10 times longer, and it's pounding the IO like a hyperactive bongo player. If the data were on disk instead of in cache, it would be a mess.
Both solutions were tested multiple times, to make sure they were operating on cache instead of disk and all that.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
September 11, 2008 at 7:58 am
How can I tell the difference between a triangular join and a linear join?
The closest I haev been able to come to understanding this is my reading Jeff Moden's article [url]http://www.sqlservercentral.com/articles/Performance+Tuning/62278/[/url], which states... {emphasis added}
Tuning Opportunity?
Let's put that in terms that everyone would appreciate... because there is no triangular join in either code, we can make the assumption that the code is nearly linear in performance. For the Microsoft example, it would take 0.313/17*1000000 milliseconds or 18.4 seconds to update a million rows. Not bad... until you realize that the performance enabled version does it in 0.015/17*1000000 milliseconds or only 0.882 seconds.
So it seems that all joins that are not triangular joins behave linearly. Is that right?
Thanks for your comments, GSquared.
Paul DB
September 11, 2008 at 11:36 am
A triangular join is generally (loosely) an inline query that references the outer query and more rows, in such a way that it has to query all/most of the outer table for each row.
For example, the one here, has to query the whole rest of the table for each row, because of the sub-query. That's why it's scan and read counts were so high.
A linear process doesn't have the correlated sub-query in it that forces multiple scans.
All the triple-join has to do is join the table to itself, then join it again. That does require a few scans, but only a few, and a few reads, but only a few. Even though it's joining on a range instead of an equality, it's still not that much work for the system.
If you aren't quite sure if a query is hidden RBAR or not, use "set statistics io on" at the beginning of the command, and look for scans and reads that are incredibly high for the number of rows being returned. Take into account that joins add scans and reads, but not huge numbers.
Did that clarify anything, or just make it muddier?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
September 11, 2008 at 2:12 pm
That clarified lots of things. 😀 I did not know how to "set statistics io on" and therefore was never seeing performance messages.
By the way... What query did you use to generate "10-thousand rows" in the Points table for your performance test? I ask because such a query would be useful in illustrating how I can properly test a query when I do not have many rows of real data in my current table.
Said differently: I know how to create a temp table with rows I've explicitly added via [SELECT ... UNION ALL SELECT ...]. But I don't know how to add thousands of records with fake yet unique data to that temp table. I'm guessing a tally table is used with incrementing DateValue values and random InRange values, but I can't quite understand it enough to form the query. 😛
Paul DB
September 11, 2008 at 3:06 pm
I didn't keep the specific script I used to populate the test data.
Generally, I use a Numbers table (same as a Tally table, just a different name) if I want sequential data, or need a specific number of rows. I use checksum and newID if I want random data.
For example:
create table #T (
ID int identity primary key,
RandomNumber float,
RandomLetter char(1),
SequentialTime datetime);
declare @letters char(26);
select @letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
insert into #T (RandomNumber, RandomLetter, SequentialTime)
select checksum(newid()), -- random number
substring(@Letters, -- random letter
abs(checksum(newid()))%26 + 1, -- random number between 1 and 26
1),
dateadd(minute, number, '1/1/2008') -- sequential minutes
from dbo.Numbers
where number between 0 and 1000; -- specific number of rows
With the general principles in this, you can generate just about any amount of random or semi-random data in whatever format you need.
My version of the Numbers table goes from 0 to 10-thousand. I have a SmallNumbers table that goes from 0 to 1-thousand, and I used to use a BigNumbers table for 0 to 1-million, but I've found that cross-joining Numbers to itself and using the Row_Number function is faster for numbers over 10-thousand, up to the max value for the Int data type (about 32-billion).
Some people start their Numbers/Tally table at 1, I start mine at 0. It's just a matter of knowing where you started it, so that your calculations from it start at the right place.
I use 0 because when I'm adding dates/minutes/hours/whatever using dateadd, I like to start with my actual starting date/time, and that means dateadd zero for the first row. It's just more convenient for me. Other people do it other ways. They're all equally valid mathematically and for SQL, so it's a personal preference thing.
Does that help?
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
September 11, 2008 at 4:27 pm
GSquared (9/11/2008)
Does that help?
Yes. :Wow:
Paul DB
September 12, 2008 at 8:18 am
Good. 🙂
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply