June 17, 2008 at 7:14 pm
Hi All,
I'm trying to design an efficient OLTP database which can handle a high INSERT rate (at least 100 per second), but still be searched relatively efficiently.
My primary OLTP table looks like:
CREATE TABLE Transactions ( Time BIGINT NOT NULL, CustId VARCHAR(20) TermId VARCHAR(12), Amt FLOAT, ... )
There are roughly 800,000 unique customer ids, 100,000 unique terminal ids.
So for every customer transaction an entry goes into this table (BIGINT is a high resolution timestamp). The clustered index is defined on Time, which is normally increasing. As you'd expect, the INSERT performance on the table as it stands is very good.
However, there are two common search cases:
- Most recent transactions by customer - SELECT TOP 100 * from Transactions where CustId = '010101037' order by Time desc
- Most recent transactions at terminal - SELECT TOP 100 * from Transactions where TermId = '475794' order by Time desc
In order to satisfy these requests, I basically need an index on both TermId and CustId. As soon as I do this and the table grows past 20 million rows, INSERT performance consistently decreases.
I assume this is because of the increasing fragmentation on the TermId and CustId indexes?
Still, I'm convinced this must an incredibly common type of application (e.g Customer service at a bank looking up recent customer transactions). Am I missing some approach that will improve performance?
Thanks,
Shaun
June 19, 2008 at 3:24 am
If the need for this index is out of the question, maybe you'll have to rebuild/reorganize your indexes more frequently.
I recommend the script from Lara Rubbelke http://blogs.digineer.com/blogs/larar/archive/2006/08/16/smart-index-defrag-reindex-for-a-consolidated-sql-server-2005-environment.aspx
Wilfred
The best things in life are the simple things
June 19, 2008 at 5:10 am
Just a quick thought. Are you setting a fill factor on your indexes or leaving them at 100%?
June 20, 2008 at 3:12 am
Also consider 1)changing recovery model
2)use TABLOCK
http://www.ITjobfeed.com
June 20, 2008 at 4:50 am
Are you aware the "TIME" is a reserved word and that there is a "TIME" datatype in SQL Server 2008 ? Please consider renaming the column. The same caution applies to "DATE".
Regarding you thruput problem, have you considered using partioning ?
For the partioning design you need to decide if to optimize for inserts or for selects. For the common search cases, do you really mean the last 100 or is the real requirement "all Transactions that have occured within a historical period."? e.g. is this a more appropriate query:
SELECT *
from Transactions
where CustId = '010101037'
AND TIME >= (some value)
order by Time desc
If historical period, then you need to decide the granularity of time and any "rounding". Granularity could be any of hours, days, months, years or centuries. Rounding could be exact (last 60 days) or have a cut-off
(since the start of the previous month)
If within a historical period, then partitioning by TIME ranges would seem to make the most sense and apply the same partitioning scheme for the secondary indexes.
SQL = Scarcely Qualifies as a Language
June 20, 2008 at 8:24 am
My guess is that you need a lower fill factor on the NC indexes coupled with routine maintenance to get back to that fill factor. This will minimize the page splits that occur during inserts, which is the likely culprit of the slowness. Also use the pad index and sort in tempdb settings. If you have Enterprise Edition you can use the online index rebuild feature if you require 24/7 insert ability.
You can monitor the Access Methods: page splits/sec perfmon counter to see if page splits are occuring.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 20, 2008 at 12:09 pm
you must definitely "limit" the Time column with a "range" once you do that you should be fine.
* Noel
June 20, 2008 at 1:23 pm
noeld (6/20/2008)
you must definitely "limit" the Time column with a "range" once you do that you should be fine.
This is being done. The combination of top 100 and order by. Since the clustering key (time) is carried in the NC index he will place on the filter column it should be just spiffy quick to access the data.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 21, 2008 at 10:27 am
jvamvas (6/20/2008)
Also consider 1)changing recovery model2)use TABLOCK
What????
Do the two queries in question need to be SELECT *?
Have you checked the execution plans of the select queries to ensure that the indexes are been used?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
June 21, 2008 at 8:37 pm
indexes is a good strategy, try to accomodate and use covering indexes to improve query performances....
Also Look into Table partitioning and multiple filegroups.
Maninder
www.dbanation.com
June 23, 2008 at 9:30 am
TheSQLGuru (6/20/2008)
noeld (6/20/2008)
you must definitely "limit" the Time column with a "range" once you do that you should be fine.This is being done. The combination of top 100 and order by. Since the clustering key (time) is carried in the NC index he will place on the filter column it should be just spiffy quick to access the data.
I have to disagree. top 100 and order by in a very large table could cause a SCAN! A where clause with upper and lower bounds will make it into a SEEK. Now, that, will really do the Job.
* Noel
June 23, 2008 at 10:17 am
noeld (6/23/2008)
TheSQLGuru (6/20/2008)
noeld (6/20/2008)
you must definitely "limit" the Time column with a "range" once you do that you should be fine.This is being done. The combination of top 100 and order by. Since the clustering key (time) is carried in the NC index he will place on the filter column it should be just spiffy quick to access the data.
I have to disagree. top 100 and order by in a very large table could cause a SCAN! A where clause with upper and lower bounds will make it into a SEEK. Now, that, will really do the Job.
You can disagree all you want, but there are two problems:
1) Your requirement is not in line with the OP's need. How can you be sure of getting the necessary 100 rows with a specific WHERE clause value if you limit the time range?
2) A scan of the clustered index in this case can be a good thing, because the OP was using SELECT TOP 100 *, which would require a bookmark lookup to get the data even if a NC index seek on a where clause column were to occur. Consider AdventureWorks.Production.TransactionHistory. This table has clustered PK on TransactionID (which is int identity and is likely a fair representation of using bigint for time series) and a NC index on ProductID, so it is a good approximation of this issue.
select top 100 *
from Production.TransactionHistory
where ProductID = 712 --2348 out of 113443 values
order by TransactionID desc
0.0357 cost, CI scan
logical reads 37
select top 10 * --using smaller value to better simulate the ratios given by the OP
from Production.TransactionHistory
where ProductID = 712 --2348 out of 113443 values
order by TransactionID desc
0.0065, CI scan
logical reads 13
--forcing index seek
select top 10 *
from Production.TransactionHistory with (index=IX_TransactionHistory_ProductID)
where ProductID = 712 --2348 out of 113443 values
order by TransactionID desc
0.036 cost (gets worse if you use top 100)
logical reads 37
thus for larger-percentage values such as 712 (2348/113443), the optimizer predicts (correctly in this case) that a CI scan results in lower cost due to the lower number of pages necessary to hit 100 ProductID = 712 rows in decending TransactionID order. Certainly a situation could exist where ProductID values were not distributed (semi)uniformly and this plan could be less efficient than index seek on ProductID
--now for a more selective value
select top 100 *
from Production.TransactionHistory
where ProductID = 521 --only 200 out of 113443 values
order by TransactionID desc
0.312 cost, ProductID index seek, bookmark lookup
logical reads 317
--due to higher specificity, optimizer realizes it would take too many CI page reads to hit the required 100 rows with ProductID = 521
--force CI scan to prove this
select top 100 *
from Production.TransactionHistory WITH (index=0)
where ProductID = 521 --only 200 out of 113443 values
order by TransactionID desc
0.745 cost
logical reads 792 (entire table)
Perhaps someone else could do some testing with larger numbers of rows (IIRC the OP mentioned 20M+ with distinct numbers of values given for the two filter columns).
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 23, 2008 at 10:25 am
TheSQLGuru (6/23/2008)
noeld (6/23/2008)
TheSQLGuru (6/20/2008)
noeld (6/20/2008)
you must definitely "limit" the Time column with a "range" once you do that you should be fine.This is being done. The combination of top 100 and order by. Since the clustering key (time) is carried in the NC index he will place on the filter column it should be just spiffy quick to access the data.
I have to disagree. top 100 and order by in a very large table could cause a SCAN! A where clause with upper and lower bounds will make it into a SEEK. Now, that, will really do the Job.
You can disagree all you want, but there are two problems:
1) Your requirement is not in line with the OP's need. How can you be sure of getting the necessary 100 rows with a specific WHERE clause value if you limit the time range?
2) A scan of the clustered index in this case can be a good thing, because the OP was using SELECT TOP 100 *, which would require a bookmark lookup to get the data even if a NC index seek on a where clause column were to occur. Consider AdventureWorks.Production.TransactionHistory. This table has clustered PK on TransactionID (which is int identity and is likely a fair representation of using bigint for time series) and a NC index on ProductID, so it is a good approximation of this issue.
select top 100 *
from Production.TransactionHistory
where ProductID = 712 --2348 out of 113443 values
order by TransactionID desc
0.0357 cost, CI scan
logical reads 37
select top 10 * --using smaller value to better simulate the ratios given by the OP
from Production.TransactionHistory
where ProductID = 712 --2348 out of 113443 values
order by TransactionID desc
0.0065, CI scan
logical reads 13
--forcing index seek
select top 10 *
from Production.TransactionHistory with (index=IX_TransactionHistory_ProductID)
where ProductID = 712 --2348 out of 113443 values
order by TransactionID desc
0.036 cost (gets worse if you use top 100)
logical reads 37
thus for larger-percentage values such as 712 (2348/113443), the optimizer predicts (correctly in this case) that a CI scan results in lower cost due to the lower number of pages necessary to hit 100 ProductID = 712 rows in decending TransactionID order. Certainly a situation could exist where ProductID values were not distributed (semi)uniformly and this plan could be less efficient than index seek on ProductID
--now for a more selective value
select top 100 *
from Production.TransactionHistory
where ProductID = 521 --only 200 out of 113443 values
order by TransactionID desc
0.312 cost, ProductID index seek, bookmark lookup
logical reads 317
--due to higher specificity, optimizer realizes it would take too many CI page reads to hit the required 100 rows with ProductID = 521
--force CI scan to prove this
select top 100 *
from Production.TransactionHistory WITH (index=0)
where ProductID = 521 --only 200 out of 113443 values
order by TransactionID desc
0.745 cost
logical reads 792 (entire table)
Perhaps someone else could do some testing with larger numbers of rows (IIRC the OP mentioned 20M+ with distinct numbers of values given for the two filter columns).
All I am saying is that SELECT TOP 100 * ... ORDER BY without a WHERE CLAUSE is not a good thing especially when you cluster by TIME. Limiting the range (IF possible) would provide fast response. I do understand though that it may not be possible.
* Noel
June 23, 2008 at 11:17 am
All I am saying is that SELECT TOP 100 * ... ORDER BY without a WHERE CLAUSE is not a good thing especially when you cluster by TIME. Limiting the range (IF possible) would provide fast response. I do understand though that it may not be possible.
I disagree with this statement as well.
1) The OP DID have a where clause on both of his sample queries.
2) TOP 100 * ... ORDER BY is a VERY efficient plan as long as the field ORDER'd by is indexed (either clustered or nonclustered will suffice). That was the case for the OP as well but I believe that this is a generically valid statement.
3) Limiting the range of of a CI value without a where clause does provide some savings on query cost, but none on IO or actual server effort.
select top 100 *
from Production.TransactionHistory
order by TransactionID desc
0.0039, CI scan
logical reads 3
select top 100 *
from Production.TransactionHistory
--filter to product exact same return as no-where-clause query above
where TransactionID between 213343 and 213442
order by TransactionID desc
0.0034
logical reads 3
The cost shows as 14.7% higher (a clear win), but the page reads and amount of data put through the CPUs is identical. The only difference in the plans is seek vice scan. The cost difference is probably due to the built-in mathematical values assigned to the seek/scan operators. Both queries also show 1ms execution time as well.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
June 23, 2008 at 11:26 am
The cost shows as 14.7% higher (a clear win), but the page reads and amount of data put through the CPUs is identical. The only difference in the plans is seek vice scan. The cost difference is probably due to the built-in mathematical values assigned to the seek/scan operators. Both queries also show 1ms execution time as well
This is a NON fair comparison. TransactionID is "very" selective. try using a time-based column and the scans for the "TOP" will be a lot less efficient.
* Noel
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply