a query using covering index, merge join, hash join

Question

a query using covering index, merge join, hash join

nzhang6666

SSCrazy

Points: 2217
More actions
August 23, 2012 at 8:08 am

#266398

Hi all,
Please note the real question is at the end.
I have following query, both pto and ph has about 30million rows. The query initially run very slow (3 mins). So I added two index on pto, ph respectively.
SELECT
MAX(ph.txn_date_time)
FROM
pto AS pto WITH (NOLOCK)
INNER JOIN ph AS ph WITH (NOLOCK) ON ph.receipt_id = pto.receipt_id
WHERE
pto.subtype = 'ff'
AND pto.Units_No > 0
AND ph.branch_id = 5
CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]
(
[SUBTYPE] ASC,
[Units_No] ASC,
[RECEIPT_ID] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
CREATE NONCLUSTERED INDEX [IX_ph_branchReceiptTxn] ON [dbo].[ph]
(
[BRANCH_ID] ASC,
[RECEIPT_ID] ASC,
[TXN_DATE_TIME] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
Now the query runs in 350ms. Great. The execution plan is also very simple, it uses the created index from the two tables and did a Hash join on the receipt_id column then a Stream Aggregate to do the MAX(ph.txn_date_time). So every column in the query is covered by the two added index.
The question is why it used a Hash join on the receipt_id column? I mean since RECEIPT_ID in both indexes are sorted the optimizer should have used a merge join. To figure out why I changed the first index to below (put RECEIPT_ID before Units_No).
CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]
(
[SUBTYPE] ASC,
[RECEIPT_ID] ASC,
[Units_No] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
And now I see the Merge join on the RECEIPT_ID column. The query also runs in 170ms. Now obviously the optimizer think the RECEIPT_ID in both indexes are sorted so a merge join is used. But I don't understand why in the first case it doesn't think so?
Thanks,
Nian

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 1

nzhang6666 (8/23/2012)
Hi all,
Please note the real question is at the end.
I have following query, both pto and ph has about 30million rows. The query initially run very slow (3 mins). So I added two index on pto, ph respectively.
SELECT
MAX(ph.txn_date_time)
FROM
pto AS pto WITH (NOLOCK)
INNER JOIN ph AS ph WITH (NOLOCK) ON ph.receipt_id = pto.receipt_id
WHERE
pto.subtype = 'ff'
AND pto.Units_No > 0
AND ph.branch_id = 5
CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]
(
[SUBTYPE] ASC,
[Units_No] ASC,
[RECEIPT_ID] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
CREATE NONCLUSTERED INDEX [IX_ph_branchReceiptTxn] ON [dbo].[ph]
(
[BRANCH_ID] ASC,
[RECEIPT_ID] ASC,
[TXN_DATE_TIME] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
Now the query runs in 350ms. Great. The execution plan is also very simple, it uses the created index from the two tables and did a Hash join on the receipt_id column then a Stream Aggregate to do the MAX(ph.txn_date_time). So every column in the query is covered by the two added index.
The question is why it used a Hash join on the receipt_id column? I mean since RECEIPT_ID in both indexes are sorted the optimizer should have used a merge join. To figure out why I changed the first index to below (put RECEIPT_ID before Units_No).
CREATE NONCLUSTERED INDEX [IX_pto_subTypeUnitReceipt] ON [dbo].[pto]
(
[SUBTYPE] ASC,
[RECEIPT_ID] ASC,
[Units_No] ASC
)WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = ON, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [Indexes]
And now I see the Merge join on the RECEIPT_ID column. The query also runs in 170ms. Now obviously the optimizer think the RECEIPT_ID in both indexes are sorted so a merge join is used. But I don't understand why in the first case it doesn't think so?
Thanks,
Nian

Look at the data and compare the output from the following queries:

select top (100)

[SUBTYPE],

[RECEIPT_ID],

[Units_No]

from

pto

order by

[SUBTYPE],

[RECEIPT_ID],

[Units_No];

select top (100)

[SUBTYPE],

[RECEIPT_ID],

[Units_No]

from

pto

order by

[SUBTYPE],

[Units_No],

[RECEIPT_ID];

nzhang6666 SSCrazy Points: 2217 More actions · Answer 2

Hi Lynn,

Just compared the results and they are same.

I don't understand why you suggest doing this. I mean even the data is different but they are still ordered. As far as I know "A Merge Join occurs on tables where the join columns are presorted." is it?

Nian

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 3

nzhang6666 (8/23/2012)
Hi Lynn,
Just compared the results and they are same.
I don't understand why you suggest doing this. I mean even the data is different but they are still ordered. As far as I know "A Merge Join occurs on tables where the join columns are presorted." is it?
Nian

Perhaps not enough values were returned. Look at the ORDER BY's, each matches the one of the indexes you created. Come up with a small set of rapidly changing data for all three columns and see what each one would return.

Look at the order of the columns between the two indexes:

[SUBTYPE], [Units_No], [RECEIPT_ID]

[SUBTYPE], [RECEIPT_ID], [Units_No]

The data in the first could look like:

ff,1,101

ff,1,103

ff,2,104

ff,2,102

ff,3,102

The same data in the second:

ff,101,1

ff,102,2

ff,102,3

ff,103,1

ff,104,2

See a difference?

nzhang6666 SSCrazy Points: 2217 More actions · Answer 4

Hi Lynn,

I got the point. The receipt_id can not guaranteed to be ordered, if it's not the first column in the ordering columns. In this case the subtype is fixed, so receipt_id is guaranteed to be ordered when it is the second column.

Thank you.