Queue processing -- MAX, correlated subquery and self-joins: which is best?

  • All-

    I've been working on a system where we basically process a queue repeatedly.  An item gets added to the queue and is processed in a batch with a success or failure stored in the table.  This same item may be processed in multiple batches either because it's changed or the previous batch failed for that item (don't worry about the changed logic for purposes of this question).  The table looks something like:

    DROP TABLE #Queue

    CREATE TABLE #Queue(Batch Int, Item Int, Success Bit)

    INSERT INTO #Queue VALUES(1, 1, 1) -- item one succeeds the first time

    INSERT INTO #Queue VALUES(2, 1, 0) -- but fails the second

    INSERT INTO #Queue VALUES(1, 2, 1) -- item two succeeds in both batches

    INSERT INTO #Queue VALUES(2, 2, 1)

    INSERT INTO #Queue VALUES(1, 3, 0) -- item three fails then succeeds

    INSERT INTO #Queue VALUES(2, 3, 1)

    INSERT INTO #Queue VALUES(1, 4, 0) --item four fails in both batches

    INSERT INTO #Queue VALUES(2, 4, 0)

    How can I find all the items in which the MOST RECENT batch job FAILED without resorting to a correlated subquery?  Basically here is the query I'm using:

    SELECT *

      FROM #Queue q WHERE Success = 0

        AND NOT EXISTS -- where there's not a more recent batch with success

        (SELECT * FROM #Queue WHERE Batch > q.Batch AND Success = 1)

    This query shows that items 1 and 4 failed in the most recent run and need to be resubmitted.

    It really *FEELS* like I should be able to do this using a self-join and a MAX -- how can I do this.  Am I wrong to be prejudiced against correlated subqueries?  I'm under the impression they're generally inferior in performance.

    Thanks,

    Mark Denner

     

  • Mark,

    I would definitely use a correlated subquery for this:

    SELECT *

    FROM #queue q

    WHERE Success = 0

    AND Batch =

    (SELECT MAX(Batch)

    FROM #queue q1

    WHERE q1.item = q.item)

    Correlated subqueries _can be_ inferior in performance -- but in many cases, they can be BETTER performers, and in cases like this one, there is simply no other way to write the query.

    --
    Adam Machanic
    whoisactive

  •  

    Hi, 

    I have a couple of idea but I need a bit of clarification.

    If, in your example you miss out the line

    INSERT INTO #Queue VALUES(2, 3, 1)

    i.e item 3 fails in batch 1 and is not included in batch 2 at all then do you want the results to be:

    batch       item        Success

    ----------- ----------- -------

    2           1           0

    1           3           0

    2           4           0

    I ask because your Query misses out the second line which doesn’t sound right.

    About Subqueries:

    Whether they perform better or worse that other queries (predictably) depends on the query and the data.  I have noticed however that sub-queries seem to restrict optimiser’s options and therefore it can miss a good solution.

    In most cases you can rewrite a correlated sub query as a join although this can be difficult and make the code more obscure.

    I usually use a join instead of Exists which gives a good improvement in performance maybe 1 in 5 times and is rarely slower

    You can also do the same for NOT EXISTS but it is less likely to improve performance and is much more obscure to read.

    This is my attempt to change your sub query to a join.  I think it does the same but there isn't really enough data to test it.

    select Q1.batch, Q1.item, Q1.Success , Q2.batch, Q2.item, Q2.Success

    from #Queue Q1          -- all failures

    left join #Queue Q2     -- All later successful batches

    on Q1.batch < Q2.batch

    and Q2.success = 1

    where Q1.Success  = 0

    and Q2.batch is NULL -- Equivalent of Not exists

     

    If  you do want all three results as listed above then you could try:

     

    select Q1.batch, Q1.item, Q1.Success, Q2.batch, Q2.item, Q2.Success

    from #Queue Q1         

    left join #Queue Q2 

    on Q1.item = Q2.item

    and Q1.batch < Q2.batch

    where Q1.Success  = 0

    and Q2.batch is NULL

    The main difference being the join on Q1.item = Q2.item (Your Query would need item = q.item to make it compatible)

     

    You might also try this:

     

    select Q1.batch, Q1.item, Q1.Success

    from #Queue Q1

    join       (select max(batch) as batch, item  from #Queue group by item ) Q2

    on Q1.item = Q2.item

    and Q1.batch = Q2.batch

    where success = 0

     

    This uses a derived table (Q2) to work out the last batch that each item appear in and then selects those which are failures.

     

    If you have a lot of batches then your sub query will probably perform faster as it will stop processing the sub query as soon as it finds a success line whereas my alternatives do a lot of work before eliminating rows.

     

    Good luck!

     

     

  • Douglas-

    You're spot on -- thanks!  You are correct, my query is incorrect -- it should have included something like AND Item = queue.Item in the subquery.  The real query is quite complex and I left this out by accident when I was simplifying it for this example.

    I've tried variations on your second query but had failed to include the "Q2.Batch IS NULL".  Now that it's sitting in front of me it's perfectly obvious, but I've been trying to figure this out off and on for a week.  If there's such a thing as karma then yours has gotten a boost for relieving me of this frustration!

    Thanks,

    Mark Denner

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply