Slow unions

  • Hello there, this is my first post here, so please excuse any etiquette mistakes. This may already have been answered elsewhere, but I can't find any specific references to the problem in these forums.

    I frequently have to query a large table (50M+ records) that is badly indexed in order to identify a very small subset of records. This table has rows of data that may or may not have a matching row immediately following that extends the data in the original row eg:

    Row 1: ID 1, Record type "T", Other data

    Row 2: ID 2, Record type "C", Some other data that belongs with row 1

    Row 3: ID 3, Record type "T", Some other data

    Row 4: ID 4, Record type "T", More data

    Row 5: ID 5, Record type "C", More data that belongs with row 4

    Row ...

    I can run an initial query that returns the T records by their IDs (nice and quick as the ID column is the primary clustered index). I then run a query that uses the IDs +1, looking for C records like this:

    select *

    from db..table with (nolock)

    where id in (944770735,944770759,948267730,948267745)

    select *

    from db..table with (nolock)

    where id in (944770736,944770760,948267731,948267746)

    and RecordType='C'

    The problem is, I want all these records in one result set so I use a union between the two and order by the ID. At this point the query starts to churn (even trying to get the execution plan takes forever), while the original queries take less than a second to run.

    The estimated execution plan shows two separate index seeks and a merge of the final selections to produce the result (shouldn't this be as fast as doing the queries separately as above?).

    Am I missing some really important but simple error that I am making here? Is the query engine merging the whole table to itself before looking up the IDs? I'm not sure where to look, so any help would be really useful.

    Thanks in advance,

    Andrew

  • Hello

    Two quick suggestions.

    Firstly, merging the rows:

    select a.*, '#' AS '#', b.*

    from db..table a with (nolock)

    LEFT JOIN db..table b ON b.id = a.id + 1 and b.RecordType='C'

    where b.id in (944770735,944770759,948267730,948267745)

    Secondly - why not put your ID list into a temp table (with the recordtype), and get everything in one pass by joining to it?

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Blimey, that was quick. Thanks for that.

    I've tried both approaches and often use the temp table approach as I have to update the result set, so the join isn't so easy to use there.

    However, I was hoping to find some nice elegant set based way of returning the required rows that I could use in an update that didn't trash the server every time I ran it. I just can't see why the union takes so long.

    I'll keep using the temp tables for the moment as it seems to be the fastest and most reliable way of achieving the desired result and I never have to fetch more than a few hundred rows, so it works niceley.

    I'll keep on reading about how to optimise my queries ...

    Thanks for the help.

  • You're welcome.

    Did you mean storing your intermediate result in a temp table, or using it as source for your id's? There might be some mileage in this approach:

    DROP TABLE #temp

    CREATE TABLE #temp (id int, RecordType VARCHAR(1))

    -- gather T rows

    INSERT INTO #temp (id, RecordType)

    SELECT 944770735, 'T' UNION ALL

    SELECT 944770759, 'T' UNION ALL

    SELECT 948267730, 'T' UNION ALL

    SELECT 948267745, 'T'

    -- create C rows

    INSERT INTO #temp (id, RecordType)

    SELECT id+1, 'C'

    FROM #temp

    -- run query

    SELECT a.*

    FROM table a with (nolock)

    INNER JOIN #temp b ON b.id = a.id AND b.RecordType = a.RecordType

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I generally use a "find and keep" approach - I find the initial records (T records as they have more information to select on) and keep them in a temp table. Then I find the next ID in the main table by selecting the ID+1 from the temp table and check to see if it's a C record. If it is, it gets appended to the temp table. This then gives me the complete set of records.

    If I need to update the records (as I usually do), I join the temp table back to the main table using the ID and and make the update on the main table.

    I have to do lots of these kinds of operations (our system is stupidly complicated) and I'm always looking for ways of using set based queries to do these things.

    Thanks again for all your help

  • If you are using a UNION instead of a UNION ALL that is a performance hit as well. UNION returns distinct rows of data so you have to deal with all of the comparisons. If you're sure that the data is already distinct (should be based on the ID as you mentioned) then using UNION ALL should be as fast as the total for each individual query.

    ST

  • andrew.rendall (4/14/2009)


    I generally use a "find and keep" approach - I find the initial records (T records as they have more information to select on) and keep them in a temp table. Then I find the next ID in the main table by selecting the ID+1 from the temp table and check to see if it's a C record. If it is, it gets appended to the temp table. This then gives me the complete set of records.

    If I need to update the records (as I usually do), I join the temp table back to the main table using the ID and and make the update on the main table.

    I have to do lots of these kinds of operations (our system is stupidly complicated) and I'm always looking for ways of using set based queries to do these things.

    Thanks again for all your help

    How do you deliver your list of ID's to the first query - the one which obtains T records? Is it a table or a comma-delimited list as you show above?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • This is where it gets complicated. I tried to simplify the question so as not to confuse the issue (the union).

    The actual query uses dates, account codes (decimal), text and anything else we can get hold of to narrow down the search across many different columns. Some are indexed, some are not. We obviously try to use the indexed columns first before reverting to unindexed data.

    We have system ID and account ID columns (from other tables) that are indexed decimals and we use them a lot, then we tend to narrow things down by date and lastly by text information which by necessity requires left() and right() functions to use (I won't explain why). These rather round-about queries give us the IDs of the rows that we want to work with the "T" records.

    We then need to find any matching "C" records and the whole lot gets updated. Simple really ... 🙂

    Hopefully by the end of the year we will have migrated a lot of our operations to a newly designed system that is just as complicated, but better structured. Hopefully this issue won't be a problem again in the future.

    As I mentioned, I will probably continue to use the temp table method for extracting the information we need.

    Thanks.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply