Large OLTP table, Index common field, Slow INSERT performance

Question

Large OLTP table, Index common field, Slow INSERT performance

Viewing 3 posts - 16 through 17 (of 17 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

noeld (6/23/2008)
The cost shows as 14.7% higher (a clear win), but the page reads and amount of data put through the CPUs is identical. The only difference in the plans is seek vice scan. The cost difference is probably due to the built-in mathematical values assigned to the seek/scan operators. Both queries also show 1ms execution time as well
This is a NON fair comparison. TransactionID is "very" selective. try using a time-based column and the scans for the "TOP" will be a lot less efficient.

Sorry Noel, again an incorrect statement. It would seem that you may be a bit unclear on how the engine returns data when TOP/ORDER BY and a clustered index is in play for the ORDER BY. It will either seek or scan to the appropriate starting page and read data in page order until sufficient rows are read. The starting value (if filtered) or the fact that no filter is present is essentially irrelevant. Even disregarding the OP's statement "BIGINT is a high resolution timestamp" (which IMHO makes an identity on 113K rows a reasonable approximation of 20M+ rows with that design), a NON-specific value for a clustered index has essentially NO bearing on the cost and IO when TOP and ORDER BY are used.

--Here the OrderDateKey represents a time series field that isn't very specific

--lets make a bigger table this time too

select *

into #frs

from (

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales) as t

go

--(425985 row(s) affected)

backup log tempdb with truncate_only

go

create clustered index idx1 on #frs (OrderDateKey)

go

backup log tempdb with truncate_only

go

select OrderDateKey, count(*)

from #frs

group by OrderDateKey

793 has 26320 rows out of 425985 --> very non-specific

1067 has 21014 rows and is the highest key

select top 100 *

from #frs

order by OrderDateKey desc

0.0038661 cost, CI scan

logical reads 4

select top 100 *

from #frs

WHERE OrderDateKey = 1067 --lowest value, so equals first return

order by OrderDateKey desc

0.0038628, CI seek

logical reads 3

select top 100 *

from #frs

WHERE OrderDateKey = 793 --non-specific value

order by OrderDateKey desc

0.0038635, CI seek

logical reads 3

select top 100 *

from #frs

WHERE OrderDateKey between 700 and 800

order by OrderDateKey desc

0.0038662, CI seek

logical reads 3

NOTE: this is actually 0.0000001 LESS than the unfiltered query above, despite 1 IO less

select top 100 *

from #frs

WHERE OrderDateKey between 1030 and 1035

order by OrderDateKey desc

14 rows only

0.0032988

logical reads 3

even a very specific filter that results in < 100 total rows doesn't save anything

on reads and only a tad on the cost

drop table #frs

go

backup log tempdb with truncate_only

go

1ms excution time for all. Seek does save 25% IO, but it is ONE 8k page. I suppose you could call that "a lot less efficient" but I do not in the big picture of server activity. Also the less specific your time values are the less effective a range filter you proposed would become. 1007 to 1035 has a total of 14 rows matching the filter, but 793 to 793 has 26320 rows matching the filter.

You could actually create a table with EVERY value the same for the clustered index and it would be the same result.

select *

into #frs

from (

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, 1 as OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, 1 as OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, 1 as OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, 1 as OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, 1 as OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, 1 as OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales

union all

select SalesOrderNumber, OrderQuantity, SalesAmount, ProductKey, 1 as OrderDateKey

from AdventureWorksDW.dbo.FactResellerSales) as t

go

--(425985 row(s) affected)

backup log tempdb with truncate_only

go

create clustered index idx1 on #frs (OrderDateKey)

go

backup log tempdb with truncate_only

go

select top 100 *

from #frs

order by OrderDateKey desc

0.0038661 cost, CI scan

logical reads 4

select top 100 *

from #frs

WHERE OrderDateKey = 1

order by OrderDateKey desc

0.0038661 cost, CI seek

logical reads 3

drop table #frs

go

backup log tempdb with truncate_only

go

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

noeld SSC Guru Points: 96590 More actions · Answer 2

I have to disagree with your statement "again". The fact that BIGINT is the datatype does not by any means implies that the values are going to be very different. The datatype just implies the resolution could be very high but it has nothing to do with the "DENSITY" or distribution of the data. Infact time-like columns tend to be not that high-res even when the datatype supports it.

On the case that ALL values are the "same" your scans are definitely get worse as the the table grows. In addition the "seek" could cut down not only in IO but also on CPU cost when the tables are very large.

I don't want to even go into partitioning and other strategies that a WHERE clause could benefit from and a TOP/ORDER BY combination can not!

It is true that distribution of the values will have an impact and that is not questionable. We are simply comparing how effective are WHERE vs ORDER BY given the posibility to choose.

* Noel

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

Well, I am at a loss here. I really REALLY thought I had given sufficient proof that if you use TOP/ORDER BY on a clustered column the distribution and specificity of the clustered column is essentially meaningless to the cost of the query. Additional rows will NOT increase the cost of the query other than the extra page read required to get to the starting point in the leaf for each additional level of depth of the clustered index (which actually increases VERY slowly once you get past the 3rd or 4th level).

Partitioning will likewise have no effect when the clustered key is the partitioning key. The optimizer will see the TOP and ORDER BY, go to the last (or first depending on sort order) partition and navigate to the last or first page and read sufficient pages to return the TOP N rows. Again, it doesn't matter if you have 100 or 100quadrillion rows in the table or 1 or 1000 partitions. And if you throw a WHERE clause in there, it will again go to the appropriate partition (this time based on partitioning key), navigate to the appropriate page and read pages (the exact same number as without a WHERE clause +0-1 page assuming all rows are of relatively equal length). The number of pages read will ALWAYS be 3, 4, 5, 6 or so depending on total number of rows (which sets the index depth) and the row lengths (which determines how many pages need to be read once you get to the correct starting point). You will not see a case where a scan will read 100 pages while a seek reads 4 for example unless there is something very funky going on with row lengths in different parts of the table.

Perhaps there is a failure to communicate here and we are for some reason speaking apples and pomegranates and not aware of the situation. In any case I yield. 🙂

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service