October 4, 2007 at 8:35 am
Hello,
Sorry if this is not the right forum for this, but I'm curious :).
We're evaluating the possibility of using partitioning in our SQL database. We have some tables that contain several million rows and everytime we need to perform COUNT() on them it's quite slow. In order to improve that, I would like a couple of clarifications regarding partitions in SQL 2005 because the BOL did not give much overview other than T-SQL statements 🙁
1) Can you partition by a bit, and inside that partition, have another partition by a date column (e.g. we have a deleted column that is a bit and would like to partition primarily by that, and then partition further by date when the deleted value is 1).
2) From what I understand, you must create your indexes on each partition as well otherwise you still browse all partitions when performing queries. Is that correct?
3) How do you implement partitioning for a structure that is already in place? Is there an easy/faster way to do it? I will need to basically partition my table and then re-create my indexes in each provided my hunch about question (2) is true.
Thanks a lot 🙂
Greg
October 4, 2007 at 8:57 am
Hi Greg,
1: You can only partition by a single column, and you cannot have subpartitions
2: indexes can be tpartitioned in different ways. You can align them with your table, but if you often join with other partitioned tables, you might want to consider partitioning you index
along differently. See http://msdn2.microsoft.com/en-us/library/ms187526.aspx
You do not need to create individual indexes on the partitions, if you create an index on the table it will create an aligned partitioned index.
3: recreating the idexes on the table is sufficient.
Regards,
Andras
October 4, 2007 at 8:59 am
Thanks a lot for the help and the amazingly quick response 🙂
October 4, 2007 at 9:10 am
Of course - you CAN partition based on a computed column, so you could backdoor your way into your partitioning scheme using the right computed column (like, say,
CASE when <deletedbit>=1 then '1' + convert(char,<deletedate>, 103) else '000000000' end
you wouldn't have multiple levels of partitions (no such thing), but you would have your partitions spread the way you wish.
Note: you must mark a computed column used in this way as PERSISTED.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 4, 2007 at 11:32 am
Interesting idea actually thanks! 🙂
I was wondering though. Let's assume I create my partition on this bit column. I first insert a new row with the deleted value to 0. It goes into one partition. Does that mean that it will automatically change partition the moment I do UPDATE table SET deleted=1 ?
This is basically just some metadata that is used when reading data to figure out if only certain parts of the table are relevant to the query?
Greg
October 5, 2007 at 1:41 am
The idea to use a persisted computed column is indeed nice. Concerning whether the row is moved to another partition if you update the column on which the computed column is based, the answer is yes. SQL Server will internally calculate the new computed expression, and it will change the update to a delete and insert, so the row will end up on the correct partition. This is of course a tiny bit more expensive than an inplace alter (both partition's indexes will be updated)
Regards,
Andras
October 5, 2007 at 5:27 am
Hi,
thanks for the answer. I've run some preliminary tests since we're trying to figure out if we're better off partitioning or using separate tables with triggers to move the data from one to the other. here are my results, do they make sense from a theoretical point of view?
Table: 1,900,000 records. about 30 columns, about 1000 only have the deleted flag to 0.
statement: SELECT * FROM table where deleted = 0 (there is a nonclustered index on deleted)
Without Partitioning: approx. 8 sec.
With Partitioning: approx. 6 sec.
With a separate table: approx. 4.5 sec
There are other joins in the real query. My point is just that the separate table is still more performant. I'm wondering if what this tells me is true, and that partitioning can only do so much but it will not be as good as a separate physical table, the main point being about manageability as well and having really only one table which avoids potential UNION etc... What I like about the 2 tables is that I can have an indexing strategy for one table and another one for the other that would be used for reporting etc...
Am I right or is there something completely off in the way I'm looking at this?
Thanks,
Greg
October 5, 2007 at 7:48 am
The main purpose of partitioning is to try to spead out the work across multiple disk sets threads and processors to speed things up. The best performance is when you manage to get SQL server to spread them across ALL four (separate disk controller, separate RAID group, separate threads, and separate processors). If you keep the partitions on the same disk sets - you will achieve only a marginal gain (like you observed). Of course - that optimal scenario is rather expensive since it involves more hardware, so it's not always worth it.
If the majority of your queries are to look only that the active stuff, and everything else is just there for archival purposes, maybe for the occasional report, then I'd say you're right, the extra overhead of partitioning probably won't be worth it: I'd pursue the two separate tables concept (active/archive table). Like you said - it's a little more work on the maintenance side (remembering to add new columns to both, moving data between active and inactive, etc...), but if the reality is that you don't run a phenomenal amount against both active and inactive - then you're just slowing down any processing on the actives by keeping the inactives in there.
The overhead of unioning two tables together, especially reasonably sized tables like yours (Partitioning was built with LARGE record sizes in mind - say 200M+ records) would likely be smaller than that in a partitioned environment. If you start unioning multiple tables, the table size keeps growing very quickly, or a lot access to a LOT of the data, then the thought about partition should probably resurface.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 5, 2007 at 8:42 am
I'm curious as to how you're doing the count(). Is it based on this bit column?
The reason I ask is that a select count(*) should complete very quickly, even with 10M rows. This used to go to sysindexes, which is very quick. Alternatively it can use anyindex, not needing to scan data. If you have a count(bit col), then that should still be fast. If you have less 0s, then you could select count(*), then select count(*) where bit = 0 and do the math.
Partitioning is often used to avoid scanning some data, like with dates (the classic). You might rarely access data older than 30 days, so you might use a 30 day partition for the newest data and another for everything older. If can spread the load as well if you always need to scan a lot of data.
October 5, 2007 at 8:52 am
Hi,
The COUNT() was one of the subqueries we use. We basically want to do something like this: SELECT COUNT(*) FROM table WHERE DELETED=0 AND ...
I mean that's of course not the only query that we use for the application, the COUNT() is here one of the cases that proved fairly slow. There are many other cases where we simply join on this table like JOIN ON A.ID=B.ID AND B.DELETED=0. These are 99.99% of the cases involved as a matter of fact.
The Count() was here for simplicity sake, but I totally understand what you mean.
I believe we will go with a mix of the architectures actually, like 2 tables, and in the table with the deleted data, a partition by a date value that we use in our reporting queries. This way we can try to have the best of both worlds. The maintenance will be a bit more complicated but simply partitioning by the BIT column would have not been very useful when we had to go in the partition where DELETED=1 since 99% of the data would be there... The other nice thing with that is that we can automatically implement a sliding window strategy to keep only 1 year worth of partitions of deleted data for example without too much manual maintenance. Chances are the partitions will be on the same disks unfortunately, this is an application that customers can install and they all have a variety of different setups in place, but we're looking into helping with that as well.
For those who have tried, how is the whole movign partitions etc... doable via scheduled jobs? Is there everything needed in T-SQL to script these kind of things?
Greg
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply