a query using covering index, merge join, hash join

  • Hi all,

    Please note the real question is at the end.

    I have following query, both pto and ph has about 30million rows. The query initially run very slow (3 mins). So I added two index on pto, ph respectively.

    SELECT

    MAX(ph.txn_date_time)

    FROM

    pto AS pto WITH (NOLOCK)

    INNER JOIN ph AS ph WITH (NOLOCK) ON ph.receipt_id = pto.receipt_id

    WHERE

    pto.subtype = 'ff'

    AND pto.Units_No > 0

    AND ph.branch_id = 5

    CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]

    (

    [SUBTYPE] ASC,

    [Units_No] ASC,

    [RECEIPT_ID] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]

    CREATE NONCLUSTERED INDEX [IX_ph_branchReceiptTxn] ON [dbo].[ph]

    (

    [BRANCH_ID] ASC,

    [RECEIPT_ID] ASC,

    [TXN_DATE_TIME] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]

    Now the query runs in 350ms. Great. The execution plan is also very simple, it uses the created index from the two tables and did a Hash join on the receipt_id column then a Stream Aggregate to do the MAX(ph.txn_date_time). So every column in the query is covered by the two added index.

    The question is why it used a Hash join on the receipt_id column? I mean since RECEIPT_ID in both indexes are sorted the optimizer should have used a merge join. To figure out why I changed the first index to below (put RECEIPT_ID before Units_No).

    CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]

    (

    [SUBTYPE] ASC,

    [RECEIPT_ID] ASC,

    [Units_No] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]

    And now I see the Merge join on the RECEIPT_ID column. The query also runs in 170ms. Now obviously the optimizer think the RECEIPT_ID in both indexes are sorted so a merge join is used. But I don't understand why in the first case it doesn't think so?

    Thanks,

    Nian

  • nzhang6666 (8/23/2012)


    Hi all,

    Please note the real question is at the end.

    I have following query, both pto and ph has about 30million rows. The query initially run very slow (3 mins). So I added two index on pto, ph respectively.

    SELECT

    MAX(ph.txn_date_time)

    FROM

    pto AS pto WITH (NOLOCK)

    INNER JOIN ph AS ph WITH (NOLOCK) ON ph.receipt_id = pto.receipt_id

    WHERE

    pto.subtype = 'ff'

    AND pto.Units_No > 0

    AND ph.branch_id = 5

    CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]

    (

    [SUBTYPE] ASC,

    [Units_No] ASC,

    [RECEIPT_ID] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]

    CREATE NONCLUSTERED INDEX [IX_ph_branchReceiptTxn] ON [dbo].[ph]

    (

    [BRANCH_ID] ASC,

    [RECEIPT_ID] ASC,

    [TXN_DATE_TIME] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]

    Now the query runs in 350ms. Great. The execution plan is also very simple, it uses the created index from the two tables and did a Hash join on the receipt_id column then a Stream Aggregate to do the MAX(ph.txn_date_time). So every column in the query is covered by the two added index.

    The question is why it used a Hash join on the receipt_id column? I mean since RECEIPT_ID in both indexes are sorted the optimizer should have used a merge join. To figure out why I changed the first index to below (put RECEIPT_ID before Units_No).

    CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]

    (

    [SUBTYPE] ASC,

    [RECEIPT_ID] ASC,

    [Units_No] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]

    And now I see the Merge join on the RECEIPT_ID column. The query also runs in 170ms. Now obviously the optimizer think the RECEIPT_ID in both indexes are sorted so a merge join is used. But I don't understand why in the first case it doesn't think so?

    Thanks,

    Nian

    Look at the data and compare the output from the following queries:

    select top (100)

    [SUBTYPE],

    [RECEIPT_ID],

    [Units_No]

    from

    pto

    order by

    [SUBTYPE],

    [RECEIPT_ID],

    [Units_No];

    select top (100)

    [SUBTYPE],

    [RECEIPT_ID],

    [Units_No]

    from

    pto

    order by

    [SUBTYPE],

    [Units_No],

    [RECEIPT_ID];

  • Hi Lynn,

    Just compared the results and they are same.

    I don't understand why you suggest doing this. I mean even the data is different but they are still ordered. As far as I know "A Merge Join occurs on tables where the join columns are presorted." is it?

    Nian

  • nzhang6666 (8/23/2012)


    Hi Lynn,

    Just compared the results and they are same.

    I don't understand why you suggest doing this. I mean even the data is different but they are still ordered. As far as I know "A Merge Join occurs on tables where the join columns are presorted." is it?

    Nian

    Perhaps not enough values were returned. Look at the ORDER BY's, each matches the one of the indexes you created. Come up with a small set of rapidly changing data for all three columns and see what each one would return.

    Look at the order of the columns between the two indexes:

    [SUBTYPE], [Units_No], [RECEIPT_ID]

    [SUBTYPE], [RECEIPT_ID], [Units_No]

    The data in the first could look like:

    ff,1,101

    ff,1,103

    ff,2,104

    ff,2,102

    ff,3,102

    The same data in the second:

    ff,101,1

    ff,102,2

    ff,102,3

    ff,103,1

    ff,104,2

    See a difference?

  • Hi Lynn,

    I got the point. The receipt_id can not guaranteed to be ordered, if it's not the first column in the ordering columns. In this case the subtype is fixed, so receipt_id is guaranteed to be ordered when it is the second column.

    Thank you.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply