How do I speed up this DISTINCT statement

Question

Post reply

How do I speed up this DISTINCT statement

Eric-230578

SSC Enthusiast

Points: 178
More actions
January 4, 2006 at 6:22 pm

#111132

Is there any way to speed up this query it takes almost 3 minutes to complete? I am running this on a dev server (2-way, 1GB RAM, all data files on 1 drive, 90GB database). I know....let's try to shoot a watermelon through a straw!!!
PS_KK_ACTIVITY_LOG = 12 million rows
PS_KK_SOURCE_HDR = 1.5 million rows
Query:
SELECT DISTINCT H.BUSINESS_UNIT ,A.LEDGER_GROUP FROM PS_KK_ACTIVITY_LOG A , PS_KK_SOURCE_HDR H WHERE H.KK_TRAN_ID = A.KK_TRAN_ID AND H.KK_TRAN_DT = A.KK_TRAN_DT
IO Statistics:
Table 'PS_KK_SOURCE_HDR'. Scan count 1, logical reads 7371, physical reads 2, read-ahead reads 7397.
Table 'PS_KK_ACTIVITY_LOG'. Scan count 1, logical reads 95857, physical reads 3, read-ahead reads 96598.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

PW-201837 SSC-Insane Points: 20805 More actions · Answer 1

PW-201837

SSC-Insane

Points: 20805

January 4, 2006 at 7:53 pm

#612763

What are the indexes on the 2 tables ?

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 2

Not sure any indexes is going to do much good.

The query is a join without further filtering, which means you'll have to scan both tables anyway since the requested result is a distinct list of all units + groups..?

Wihtout the distinct I'd assume the result would be all 12 million rows from the childtable.

/Kenneth

Eric-230578 SSC Enthusiast Points: 178 More actions · Answer 3

Here are the indexes. I have run out of ideas to fix.

PS_KK_ACTIVITY_LOG:

nonclustered located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, KK_TRAN_LN

nonclustered located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, LEDGER

nonclustered located on PRIMARY - LEDGER, KK_TRAN_DT, BUSINESS_UNIT, FISCAL_YEAR, ACCOUNTING_PERIOD, ACCOUNT, BUDGET_REF

nonclustered located on PRIMARY - LEDGER, DEPTID, FUND_CODE, ACCOUNT, BUDGET_REF

nonclustered located on PRIMARY - LEDGER, PROJECT_ID, DEPTID, FUND_CODE, ACCOUNT, BUDGET_REF

nonclustered located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, KK_TRAN_LN, BALANCING_LINE, BUSINESS_UNIT, LEDGER_GROUP, BUDGET_REF

clustered, unique located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT, KK_TRAN_LN, REFERENCED_BUDGET, BALANCING_LINE, LEDGER_GROUP, LEDGER, FISCAL_YEAR, ACCOUNTING_PERIOD, SEQNBR, RVRSL_FLG

PS_KK_SOURCE_HDR:

nonclustered located on PRIMARY - KK_PROC_INSTANCE, KK_PROCESS_STATUS, KK_SOURCE_TRAN, SEQUENCE_NBR_9

nonclustered located on PRIMARY - KK_SOURCE_TRAN, BUSINESS_UNIT, REQ_ID

nonclustered located on PRIMARY - KK_SOURCE_TRAN, BUSINESS_UNIT, PO_ID

nonclustered located on PRIMARY - KK_SOURCE_TRAN, BUSINESS_UNIT, VOUCHER_ID

nonclustered located on PRIMARY - - KK_SOURCE_TRAN, BUSINESS_UNIT, JOURNAL_ID, JOURNAL_DATE

nonclustered located on PRIMARY - KK_PROC_INSTANCE, KK_PROCESS_STATUS, KK_SOURCE_TRAN, SEQUENCE_NBR_9, BUSINESS_UNIT, VOUCHER_ID, KK_TRAN_ID, KK_TRAN_DT

nonclustered located on PRIMARY - DT_TIMESTAMP, KK_TRAN_ID, KK_TRAN_DT

nonclustered located on PRIMARY - BUSINESS_UNIT, PO_ID, KK_TRAN_ID, KK_TRAN_DT

nonclustered located on PRIMARY - BUSINESS_UNIT, VOUCHER_ID, KK_TRAN_ID, KK_TRAN_DT

clustered, unique located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT

Eric-230578 SSC Enthusiast Points: 178 More actions · Answer 4

Well I re-ran a DBCC REINDEXDB on the two tables. Got the time down to 1 minute 15 seconds. Looking at perfmon I am CPU and DISK bound (queue lengths all are above 4).

PW-201837 SSC-Insane Points: 20805 More actions · Answer 5

We also need to see the query execution plan. Put this before the SQL and post the results:

Set Showplan_text on

go

>>The query is a join without further filtering, which means you'll have to scan both tables anyway

I disagree with this. You want to see scanning of the smaller PS_KK_SOURCE_HDR table, but for each attempt to join 1 of these rows to the 12 million details, you'd want an index or clustered-index seek.

I suspect the biggest issue here is the number of columns in the clustered index on PS_KK_ACTIVITY_LOG.

This makes the index huge, and don't forget, this also gets replicated in each non-clustered index of the table for use in bookmark lookups.

eg, run sp_spaceused on PS_KK_ACTIVITY_LOG. Note the results.

Now change the clustered index:

clustered, non-unique located on PRIMARY - KK_TRAN_ID, KK_TRAN_DT

Re-run your DBCC DbReindex and re-run sp_spaceused - how does the size usage change ? How does the query plan change ?

Eric-230578 SSC Enthusiast Points: 178 More actions · Answer 6

StmtText

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

SELECT DISTINCT H.BUSINESS_UNIT ,A.LEDGER_GROUP FROM PS_KK_ACTIVITY_LOG A ,

PS_KK_SOURCE_HDR H WHERE H.KK_TRAN_ID = A.KK_TRAN_ID AND H.KK_TRAN_DT

= A.KK_TRAN_DT

(1 row(s) affected)

StmtText

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

|--Hash Match(Aggregate, HASH[H].[BUSINESS_UNIT], [A].[LEDGER_GROUP]), RESIDUAL[H].[BUSINESS_UNIT]=[H].[BUSINESS_UNIT] AND [A].[LEDGER_GROUP]=[A].[LEDGER_GROUP]))

|--Merge Join(Inner Join, MERGE[H].[KK_TRAN_ID], [H].[KK_TRAN_DT])=([A].[KK_TRAN_ID], [A].[KK_TRAN_DT]), RESIDUAL[A].[KK_TRAN_ID]=[H].[KK_TRAN_ID] AND [A].[KK_TRAN_DT]=[H].[KK_TRAN_DT]))

|--Clustered Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_SOURCE_HDR].[PS_KK_SOURCE_HDR] AS [H]), ORDERED FORWARD)

|--Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_ACTIVITY_LOG].[PSBKK_ACTIVITY_LOG] AS [A]), ORDERED FORWARD)

(4 row(s) affected)

StmtText

------------------------

set statistics io off

(1 row(s) affected)

Eric-230578 SSC Enthusiast Points: 178 More actions · Answer 7

Looks like the index size grew.

1. clustered, unique

name rows reserved data index_size unused

--------------------------------------- ----------- ------------------ ------------------ ------------------ -----------

PS_KK_ACTIVITY_LOG 12556668 10376000 KB 4018136 KB 6357624 KB 240 KB

2. clustered, non-unique

name rows reserved data index_size unused

--------------------------------------- ----------- ------------------ ------------------ ------------------ -----------

PS_KK_ACTIVITY_LOG 12556668 10685552 KB 4018136 KB 6667176 KB 240 KB

Here is the showplan:

ELECT DISTINCT H.BUSINESS_UNIT ,A.LEDGER_GROUP FROM PS_KK_ACTIVITY_LOG A ,

PS_KK_SOURCE_HDR H WHERE H.KK_TRAN_ID = A.KK_TRAN_ID AND H.KK_TRAN_DT

= A.KK_TRAN_DT

(1 row(s) affected)

StmtText

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

|--Hash Match(Aggregate, HASH[H].[BUSINESS_UNIT], [A].[LEDGER_GROUP]), RESIDUAL[H].[BUSINESS_UNIT]=[H].[BUSINESS_UNIT] AND [A].[LEDGER_GROUP]=[A].[LEDGER_GROUP]))

|--Merge Join(Inner Join, MERGE[H].[KK_TRAN_ID], [H].[KK_TRAN_DT])=([A].[KK_TRAN_ID], [A].[KK_TRAN_DT]), RESIDUAL[A].[KK_TRAN_ID]=[H].[KK_TRAN_ID] AND [A].[KK_TRAN_DT]=[H].[KK_TRAN_DT]))

|--Clustered Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_SOURCE_HDR].[PS_KK_SOURCE_HDR] AS [H]), ORDERED FORWARD)

|--Index Scan(OBJECT[FXDUSR].[dbo].[PS_KK_ACTIVITY_LOG].[PSBKK_ACTIVITY_LOG] AS [A]), ORDERED FORWARD)

(4 row(s) affected)

PW-201837 SSC-Insane Points: 20805 More actions · Answer 8

I have a similar, but slightly smaller data set that I was able to test with. An Account table (500K rows) and transactions on each account (5 million rows). Selecting a Distinct column pair, with no WHERE clause that hits every record but summarises down to 3K rows, my query will always clustered index seek on the larger table and return in less than 10 seconds.

Depending on the relative selectivity of your columns, you could try changing the column order in the clustered index - are there multiple KK_TRAN_ID's per KK_TRAN_DT value or vice versa ?

You could also try creating a non-clustered covering index - since the only 3 columns needed from your larger table are KK_TRAN_ID, KK_TRAN_DT and LEDGER_GROUP, you could create a 3-column index on these 3 columns, and the query would be able to use the index only without needing bookmark lookups to the data pages.

Eric-230578 SSC Enthusiast Points: 178 More actions · Answer 9

I added the covering key same results. Besides the disk queue length is stuck at between 4 and 12. Until then....waiting on hardware. Thanks for your help.