Queue processing -- MAX, correlated subquery and self-joins: which is best?

Question

Post reply

Queue processing -- MAX, correlated subquery and self-joins: which is best?

mdenner

SSC Eights!

Points: 833
More actions
October 25, 2004 at 1:04 pm

#59542

All-
I've been working on a system where we basically process a queue repeatedly. An item gets added to the queue and is processed in a batch with a success or failure stored in the table. This same item may be processed in multiple batches either because it's changed or the previous batch failed for that item (don't worry about the changed logic for purposes of this question). The table looks something like:
DROP TABLE #Queue
CREATE TABLE #Queue(Batch Int, Item Int, Success Bit)
INSERT INTO #Queue VALUES(1, 1, 1) -- item one succeeds the first time
INSERT INTO #Queue VALUES(2, 1, 0) -- but fails the second
INSERT INTO #Queue VALUES(1, 2, 1) -- item two succeeds in both batches
INSERT INTO #Queue VALUES(2, 2, 1)
INSERT INTO #Queue VALUES(1, 3, 0) -- item three fails then succeeds
INSERT INTO #Queue VALUES(2, 3, 1)
INSERT INTO #Queue VALUES(1, 4, 0) --item four fails in both batches
INSERT INTO #Queue VALUES(2, 4, 0)
How can I find all the items in which the MOST RECENT batch job FAILED without resorting to a correlated subquery? Basically here is the query I'm using:
SELECT *
FROM #Queue q WHERE Success = 0
AND NOT EXISTS -- where there's not a more recent batch with success
(SELECT * FROM #Queue WHERE Batch > q.Batch AND Success = 1)
This query shows that items 1 and 4 failed in the most recent run and need to be resubmitted.
It really *FEELS* like I should be able to do this using a self-join and a MAX -- how can I do this. Am I wrong to be prejudiced against correlated subqueries? I'm under the impression they're generally inferior in performance.
Thanks,
Mark Denner

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Adam Machanic SSCoach Points: 15370 More actions · Answer 1

Mark,

I would definitely use a correlated subquery for this:

SELECT *

FROM #queue q

WHERE Success = 0

AND Batch =

(SELECT MAX(Batch)

FROM #queue q1

WHERE q1.item = q.item)

Correlated subqueries _can be_ inferior in performance -- but in many cases, they can be BETTER performers, and in cases like this one, there is simply no other way to write the query.

--
Adam Machanic
whoisactive

Douglas De Ivey Old Hand Points: 396 More actions · Answer 2

Hi,

I have a couple of idea but I need a bit of clarification.

If, in your example you miss out the line

INSERT INTO #Queue VALUES(2, 3, 1)

i.e item 3 fails in batch 1 and is not included in batch 2 at all then do you want the results to be:

batch item Success

----------- ----------- -------

2 1 0

1 3 0

2 4 0

I ask because your Query misses out the second line which doesn’t sound right.

About Subqueries:

Whether they perform better or worse that other queries (predictably) depends on the query and the data. I have noticed however that sub-queries seem to restrict optimiser’s options and therefore it can miss a good solution.

In most cases you can rewrite a correlated sub query as a join although this can be difficult and make the code more obscure.

I usually use a join instead of Exists which gives a good improvement in performance maybe 1 in 5 times and is rarely slower

You can also do the same for NOT EXISTS but it is less likely to improve performance and is much more obscure to read.

This is my attempt to change your sub query to a join. I think it does the same but there isn't really enough data to test it.

select Q1.batch, Q1.item, Q1.Success , Q2.batch, Q2.item, Q2.Success

from #Queue Q1 -- all failures

left join #Queue Q2 -- All later successful batches

on Q1.batch < Q2.batch

and Q2.success = 1

where Q1.Success = 0

and Q2.batch is NULL -- Equivalent of Not exists

If you do want all three results as listed above then you could try:

select Q1.batch, Q1.item, Q1.Success, Q2.batch, Q2.item, Q2.Success

from #Queue Q1

left join #Queue Q2

on Q1.item = Q2.item

and Q1.batch < Q2.batch

where Q1.Success = 0

and Q2.batch is NULL

The main difference being the join on Q1.item = Q2.item (Your Query would need item = q.item to make it compatible)

You might also try this:

select Q1.batch, Q1.item, Q1.Success

from #Queue Q1

join (select max(batch) as batch, item from #Queue group by item ) Q2

on Q1.item = Q2.item

and Q1.batch = Q2.batch

where success = 0

This uses a derived table (Q2) to work out the last batch that each item appear in and then selects those which are failures.

If you have a lot of batches then your sub query will probably perform faster as it will stop processing the sub query as soon as it finds a success line whereas my alternatives do a lot of work before eliminating rows.

Good luck!

mdenner SSC Eights! Points: 833 More actions · Answer 3

Douglas-

You're spot on -- thanks! You are correct, my query is incorrect -- it should have included something like AND Item = queue.Item in the subquery. The real query is quite complex and I left this out by accident when I was simplifying it for this example.

I've tried variations on your second query but had failed to include the "Q2.Batch IS NULL". Now that it's sitting in front of me it's perfectly obvious, but I've been trying to figure this out off and on for a week. If there's such a thing as karma then yours has gotten a boost for relieving me of this frustration!

Thanks,

Mark Denner