June 6, 2003 at 11:17 am
I have a table with approximately 100 million rows. (~30 million row per day storing up to 30 days.)
A procedure runs every day that deletes any data older than 30 days.
Another procedure runs every day that inserts one days worth of data (~30 million rows)
This table is queried with a join on the userID field and the Date field
eg. select * from t100MillionRows A
join t10MillionRows B on A.userID = B.userID and A.Date <= B.Date
Any suggestions as to how to optimally index this table seeing as though there is so much turnover of the data???
What would be an appropriate FILLFACTOR?
Can I take advantage of the fact that all rows are inserted in ascending Date order (not including time)?
Some additional information:
1)
The table takes ~8 gigs of space with no indexes. Space is not a big issue. I have nearly 200 gigs to work with.
2)
Since this is a daily process, is it worth indexing or will the time that it takes to create the indexes exceed the benefits that they yield. (I only need to qeury the table three times).
3)
The userID column consists of a 24 digit number. Since I join on this column, would it be better to replace that varchar field with 3 integer fields and then join on 3 fields???
4) The table structure:
create table t100MillionRows (
UserIDvarchar (24),
DateDatetime,
ID1Int,
ID2Int,
ID3Int,
ID4Int,
word1 varchar(50))
Let me know if I left out any key information.
Thanks,
-J
June 6, 2003 at 11:45 am
1) Fillfactor: This depends on how often you rebuild the indexes. If you can rebuild the indexes directly after the DELETE/INSERT, and there are no updates throughout the querying period, then set the FILLFACTOR=100%, maximizing the # index keys per extent retrieved on a read.
2) Can the varchar(24) field be converted into a straight numeric type without difficulty? If it can, that might be your best choice for a datatype on the UserID column
Edited by - jpipes on 06/06/2003 11:46:50 AM
June 6, 2003 at 12:28 pm
If I were you I would seriously consider splitting the table.
By that I mean if more than 60%-70% percent of your data is not being
used on a regular basis, I will move it to another table.
This means that you can create separate clustered indexes for both tables.
The table with all the old data will not see any data insertion, therfore index
maintenace is minimal. On the other hand, the orignal table might cause performance
issues due to the fact that Indexes have to be maintained during data insertion.
But if data insertion is scheduled during non peak hours, this issue might be irrelevant.
Now that the work load has been distributed, you can concentrate on index selection.
Based on your query I suggest creating a clustered index on cloumn userid and date, using
a FILLFACTOR of 85-90 to compact your data pages. This will help your query a great deal,
but you have to constantly maintain your index and update the statistics at least every 3 days due to the high volume of data. Your index maintenace can be done by scheduling a nightly job.
Remember this is just a suggestion and you need to experiment to find what works
best in your case. SQL's performance is directly related to how much memory and
processing power you can throw at it.
I hope this helps and if need more info drop me a line @ mickeytech@yahoo.com
MW
MW
June 6, 2003 at 12:56 pm
It sounds like a high volume OLTP environment. The typical challenge with this type of environment is keeping the indexes optimized and dealing with the contention of all those inserts while you are trying to delete. If so, you may want to take advantage of a couple of techniques.
a) You might benefit from creating an additional surrogate key field such as an identity column so you have a much smaller key.
b) You might want to look into partitioning your data daily having a view that performs a UNION ALL query for reports, etc. Read BOL on partitioned views. Going this route will require some sophistication in a nightly job to create new tables, drop old table and modify the union view to accomodate these schema changes.
You will gain three primary advantages. First, re-indexing will only be performed once (on yesterday's table) on a given dataset. Secondly, there is no locking contention among the inserted and deleted data because you are dropping a table not performing deletes on the table where transactions are being inserted. Thirdly, you don't have all those deletes being recorded in the transaction log.
We're recording 15MM transactions/day and this works well for us. Hope this helps.
Bill Bertovich
Bill Bertovich
bbertovich@interchangeusa.com
June 9, 2003 at 10:50 am
Thanks very much for all the replies.
As per bertovich's suggestion, I think that I'm going to work towards storing each days data in a separate table and use a partitioned view to access it.
The transaction log benefits are huge and the userID/date primary key definitely lends itself to a partioned view.
I hope it works out, I'll let you know if I run into any snags.
Thanks again.
-J
Viewing 5 posts - 1 through 4 (of 4 total)
You must be logged in to reply to this topic. Login to reply